Application of multivariate time-series model for high performance computing (HPC) fault prediction

Xiangdong Pei; Min Yuan; Guo Mao; Zhengbin Pang

doi:10.1371/journal.pone.0281519

Abstract

Aiming at the high reliability demand of increasingly large and complex supercomputing systems, this paper proposes a multidimensional fusion CBA-net (CNN-BiLSTAM-Attention) fault prediction model based on HDBSCAN clustering preprocessing classification data, which can effectively extract and learn the spatial and temporal features in the predecessor fault log. The model can effectively extract and learn the spatial and temporal features from the predecessor fault logs, and has the advantages of high sensitivity to time series features and sufficient extraction of local features, etc. The RMSE of the model for fault occurrence time prediction is 0.031, and the prediction accuracy of node location for fault occurrence is 93% on average, as demonstrated by experiments. The model can achieve fast convergence and improve the fine-grained and accurate fault prediction of large supercomputers.

Citation: Pei X, Yuan M, Mao G, Pang Z (2023) Application of multivariate time-series model for high performance computing (HPC) fault prediction. PLoS ONE 18(10): e0281519. https://doi.org/10.1371/journal.pone.0281519

Editor: Muhammad Fazal Ijaz, Sejong University, REPUBLIC OF KOREA

Received: October 14, 2022; Accepted: January 24, 2023; Published: October 17, 2023

Copyright: © 2023 Pei et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Our research data is from the failure logs of supercomputers operated by the Shanxi Supercomputing Center from 2016 to 2018, which contains 8718121 logs. The study data set is available at the following website. https://github.com/YMyyds/Shanxi-Supercomputing-Center-Fault-Data1.

Funding: This work was supported by a scientific research project of the Science and Technology Department of Shanxi Province in the form of a grant (2020FP-11) awarded to XP. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

1. Introduction

In recent years, owing to the increasing demand for high-performance computing (HPC) as well as the scale-up supercomputers and intelligent computing systems, the reliability of large-scale computing systems has been investigated extensively [1–4]. The system operation is complex, and failures occur frequently which are difficult to detect, locate, diagnose, analyze, and debug [1,5,6]. The existing system health check monitoring and techniques generally monitor faults through different log sources, such as root cause diagnosis and fault detection. However, they still lack the means to proactively handle faults in the face of more complex large-scale supercomputer systems. First, the complexity of supercomputer systems is determined by their novel architectures, continuously updated designs, constantly upgraded applications, and flexible logging mechanisms. Existing fault self-diagnosis techniques are inadequate to cope with these complex changes [6–9]. Due to the increasing application of artificial intelligence, big data, and the rapid development of computing hardware and applications, the operation and maintenance approach has evolved from DevOps (Development Operations) [10,11] to AIOps (Artificial Intelligence for IT Operations) [12–14]. The intelligent operation and maintenance approach can be combined with big data, AI machine learning, and other technologies to support the operational functions of IT equipment through proactive, personalized, and dynamic insights. The AIOps platform supports the simultaneous use of multiple data sources, data acquisition analytics (real-time and deep), and representation technologies. Intelligent operation and maintenance algorithms are emerging technologies that integrate deep learning, time-series data, anomaly detection, and root cause localization in multiple dimensions.

The physical architecture of a supercomputer typically contains a log collection service system where system logs are collected in real-time for feedback. It is a system that allows administrators to understand the system status and fault events whenever necessary. Fault data includes multidimensional attributes of fault events, whereas various attribute elements are highly correlated and are primarily categorized into temporal and spatial correlations.

1.1 Time series-based representation

Temporal correlation in supercomputers refers to the following two aspects: first, specific faults can cause multiple faults on multiple nodes in a short period; second, the same fault can occur multiple times on a node before the root cause is identified and resolved. Spatial autocorrelation refers to the potential interdependence between the observations of variables within the same distribution. System failure prediction aims to predict possible failures that may occur during operation based on the current system state. The fault-prediction task is illustrated in Fig 1.

Download:

Fig 1. Schematic illustration of fault prediction.

https://doi.org/10.1371/journal.pone.0281519.g001

Time series-based representation: At the current moment t, the possible failures are predicted in advance based on the observed system state by monitoring the system with a data window of length Δt_d. The advanced time is called the lead time Δt_l. The length of time Δt_p represents the validity of the prediction, also known as the prediction period. Increasing Δt_p increases the probability that a failure will be correctly predicted. Δt_w is the minimum warning period, which is the minimum time needed to take preventive measures. If the lead time Δt_l is shorter than the minimum warning period Δt_w, the preventive measures would not be taken on time.

1.2 Spatial feature-based representation

Spatial correlation owns two characteristics. First, ineviTablele failures can occur (almost) synchronously in the same subsystem or multiple nodes at the boundary of the subsystem, such as failures in high-speed interconnections and file storage. Second, errors occurring in one node can trigger other errors in different nodes [15]. The research object of this work is a supercomputer deployed at the Shanxi Provincial Supercomputing Center, which possesses the following characteristics: the computer rack contains four computing frames, where each frame contains 32 computing nodes, a switching board, and a display board, all of which are connected through a backplane. Moreover, the system comprises 16 computer racks with 2048 computing nodes being involved. The supercomputer is deployed with a visualization system to view the failure situation in real-time, as shown in Fig 2(a), where the red highlighted region can provide warnings regarding excessively high temperatures and memory overflow in real-time. Thus, the research object of this work exhibits both temporal and spatial attributes(Fig 2(b)), and for the convenience of vectorized data processing, it can be abstracted as data cubes. The reliability of computational systems has been improved to a certain extent hitherto [11,12]. However, fault prediction methods based on conventional machine learning and neural networks are merely applicable to a limited are of computational platforms, whereas localization and prediction accuracy require further improvement. The contributions of this work are as follows:

A fault log preprocessing mechanism based on HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) clustering is introduced to extract multivariate feature information from the low-dimensional space of fault logs.
Based on intelligent operation and maintenance, a fault prediction model of multivariate time-series (CNN–BiLSTM–attention) is proposed, which processes classification data based on HDBSCAN clustering and affords rapid convergence to improve prediction accuracy.
The proposed multidimensional model can effectively extract and fuse spatial and temporal features in fault logs, and it is highly sensitive to time-series features. Besides, it can extract local features more effectively compared with conventional machine learning methods. Experimentally, the model yields results in more effective fault prediction of large computing systems to support decision-making and system management.

Download:

Fig 2. Supercomputer system fault monitoring viewable and time- and space-based fault log data cubes.

https://doi.org/10.1371/journal.pone.0281519.g002

The rest of the article is organized as follows. Section 2 describes our motivation for conducting this study. Section 3 reviews the related research work. Section 4 provides a detailed description of the overall approach of the Multivariate Time-Series Model. Section 5 describes the procedure of the experiment. Then the conclusion and discussion are in section 6.

2. Motivation

With the increasing complexity of supercomputer systems, the types of faults in supercomputers are becoming more and more complex. The traditional unitary fault tolerance strategies such as system checkpointing techniques are difficult to adapt to fault tolerance for complex system failures [8]. Identifying the intrinsic fault association characteristics of the system through statistical laws can be applied to fault prediction and lightweight pre-processing of the system, which is a key way to achieve active fault tolerance in supercomputers [16]. Based on the fault log data to discover the failure occurrence law of major computing components, and for the problem of quantitative description of failure time of major computing components in supercomputers [6], the failure data of supercomputers are analyzed according to time and space dimensions, and a multi-dimensional unified failure time model adapted to supercomputers is esTablelished, and through the synergistic analysis of applications and failures, the impact of different applications on system failures is discovered, which can develop targeted fault-tolerance strategies. The existing research work has improved the reliability of computing systems to a certain extent, however, the fault prediction methods based on traditional machine learning and neural networks only target a specific class of computing platforms, and the localization and prediction accuracy need to be further improved. We introduce a fault log preprocessing mechanism based on HDBSCAN clustering in the process of carrying out fault prediction for large-scale supercomputing systems, and first extract multivariate feature information from the low-dimensional space of fault logs. Then a multidimensional fusion network prediction model is constructed, which can effectively learn and fuse the spatial and temporal features in fault logs, and has the advantages of high sensitivity to time series features and adequate extraction of local features compared with traditional machine learning methods.

3. Related studies

Fault prediction is vital to supercomputing system reliability research. Therefore, the fault tolerance of supercomputing systems has been investigated extensively [7,17]. The present research primarily focus on highlighting fault sources and developing the corresponding prediction mechanisms [18]. Das et al. also propose a machine learning method that uses short-term memory networks to predict node failures with three minutes lead time, 85% recall, and 83% accuracy [1]. Frank et al. based on multiple, independently trained neural networks using different lead-up time offsets, combined with simple majority voting where a consensus among neural networks is required to issue a positive (failure) final prediction [8]. Shetty et al. used the XGboost classifier for prediction class prediction based on task failure features on the Google cluster dataset and achieved high prediction accuracy [19]. Gainaru et al. proposed a signal-based fault prediction method to identify regular times in system logs as signal data and employed algorithms to mine progressive association rules to calculate the temporal relationship between events and the favorable results were ultimately obtained [20]. Fujitsu Laboratories has developed a method to create and learn message patterns in real time, which is based on the fault prediction technique of message pattern learning, and has obtained evaluation results of its performance by obtaining messages online for experimental fault prediction in a real cloud data center. However, the success rate of its prediction needs to be improved [21]. However, these methods require overly complex feature extractions, and the models cannot be easily adapted to the scale of the system. Currently, machine learning has been widely used to extract features from log data [22]. Ju et al. applied the attention mechanism to LSTM, enabling LSTM to screen multiple sequences, remove irrelevant redundant information, and capture information about interactions between sequences [23]. Chen et al. used an RNN to predict the probability of job failure from a task, and the prediction results afforded the conservation of system resources despite their low accuracy [24]. Zhu et al. employed a support vector machine and neural network methods to predict hard-disk failures [25]. Nie et al. used a GPU to analyze the correlation among temperature, power, and error. And they finally proposed a neural network-based prediction method and predicted four cabinets on a TITAN supercomputer with 82% accuracy [26]. Islam et al. proposed the use of LSTM for prediction [27], which was not completely accurate but facilitated the conservation of system resources [28]. Although these methods solve problems pertaining to feature extraction, they could not reveal the dependencies between faults, and none of the prediction results yielded are satisfactory. Table 1 summarizes the related studies of traditional methods and deep learning methods in fault prediction of large-scale complex computing systems.

Download:

Table 1. Related research of traditional methods and deep learning methods in fault prediction of large-scale complex computing systems.

https://doi.org/10.1371/journal.pone.0281519.t001

4. Multivariate time-series model

In this section, the architecture diagram integrating HDBSCAN, CNN, LSTM, andattention is explained. HDBSCAN is used to cluster data with different faults and to preprocess the fault data. The first layer of the CBA network model is the CNN layer, whose main role is to extract the local temporal and spatial features of the fault logs, while BiLSTM is used to maintain multivariate time-series fault features while predicting the next state, and finally the attention mechanism is used to enhance the features with high impact on the results to further improve the accuracy of fault prediction.

4.1 Data preprocessing based on HDBSCAN

Data clustering is a process of arranging similar data in different groups based on certain characteristics and properties, and each group is considered as a cluster [23]. Resulting from the causes of supercomputing system failures, such as hardware, software, and operational failures, numerous categories of failure logs exist. For noisy log data, the present research applies the HDBSCAN [29] algorithm primarily to process the fault logs and combines data with similar characteristics to obtain more accurate prediction results [30].

The HDBSCAN clustering algorithm is an improved version of the density-based clustering algorithm DBSCAN [31]. It is a clustering method that combines the DBSCAN algorithm and the hierarchical clustering algorithm. The DBSCAN clustering algorithm yields better results than other clustering algorithms on anomalous data datasets [32]. However, it could merely cluster data with same density distribution, and the clustering process requires the adjustment of two parameters, i.e., Minpts (the minimum step size) and Eps (the domain radius), thus restricting the use of the DBSCAN clustering algorithm. Hence, hierarchical clustering is introduced into the HDBSCAN clustering algorithm, where the method for measuring the distance between two points is redefined as follows: (1) where d_mreach−k (a, b) refers to the mutual reachable distance between two points a and b, and d(a,b) is the Euclidean distance between a and b. The clustering algorithm uses the minimum spanning tree to construct the hierarchical tree model between points, which implies that only the minimum number of clusters (min_cluster_size) is to be defined in the algorithm to obtain the optimal clustering results. Therefore, complicated tuning can be avoided, and the clustering accuracy and applicability can be improved. Alg 1. shows the pseudo-code of HDBSCAN. The main steps of HDBSCAN are as follows: transforming the space→ building the minimum spanning tree→ building the cluster hierarchy→ condensing the cluster tree→ extracting the clusters.

Algorithm 1. HDBSCAN clustering pseudocode.

Input: Location data: LD, Parameter: Eps and Minpts,S-Tree: Height

Output: LD with cluster lable and Spatial_Tree was built

1. DBSCAN_ OBJECT Root = Joint(LD,Eps,Minpts); // root node of Tree

2. enqueue(Q, Root); // push DBSCAN object into Queue

3. front: = 0, last: = 0, level = 0;

4. while(Queue<>empty and front< = last) do

5. DBSCAN_ OBJECT node = DEQueue(Q); // Pull data from Queue

6. front++;//

7. Data_OBJECT Childern = DBSCAN.getCluster(node); //Call DBSCAN

8. if(level > Height)

9. break;

10. for i FROM 1 TO Childern.size do

11. Data child = Childern.get(i);

12. DBSCAN_ OBJECT Root = Joint(child,Eps,Minpts);

13. enqueue(Q,DBSCAN_ OBJECT);

14. end for

15. if(front>last) // members in one level have been searched

16. last = Q.size()+front-1;

17. level ++;

18. end if

19. end while

4.2 Methodology

4.2.1 Convolutional neural networks.

The first layer of the model is a CNN (Convolutional Neural Network) layer, whose primary role is to extract the local features of the fault logs. Fig 3 shows the Convolutional Neural Networks.

Download:

Fig 3. Convolutional neural networks.

https://doi.org/10.1371/journal.pone.0281519.g003

The extraction of fault feature information with a time series by a 1D CNN is primarily performed by filters in the convolutional layer, which contains amounts of kernels. Each kernel comprises acceptable log information fields, and each layer is convolved by a modified linear unit (ReLU) activation function as follows: (2)

After the activation function modifies the negative values and solves the gradient disappearance and gradient explosion problems, feature mapping is performed by the filter via the following equation: (3) where is the output of the nth filter in convolutional layer m, f the activation function, the weight of the convolutional kernel, the bias, and x the input feature vector. Finished in convolutional layer, the features are dimensioned through the max-pooling layer to compress the data and decrease the number of parameters to prevent overfitting.

4.2.2 BiLSTM prediction network.

Considering the time-dynamical nature of supercomputer systems. Conventional supervised learning methods, such as logistic regression, support vector machines, and tree-based classifiers, only consider input sequences as independent features but could not capture the temporal dependence between them. In this study, recurrent neural networks (RNNs) were applied to the system to overcome the disadvantages of conventional learning methods. Nonetheless, classical RNNs lack the function of storing previous input information for a long duration, which weakens their ability to model the long-range structure of the input sequences. LSTM(Long Short-Term Memory)is an RNN architecture that aims to improve the ability of RNNs in storing and accessing information [33]. In this work, an LSTM-based prediction network was applied to model the dynamic properties of computer systems(Fig 4), which is conducted based on the significant time dependence of fault prediction in the computational systems described above.

Download:

Fig 4. BiLSTM network structure.

https://doi.org/10.1371/journal.pone.0281519.g004

Since fault log information is time-based serial information, temporal characteristics are critical for predicting faults. LSTM is an improved version of the RNN model [33], which solves the problem of gradient explosion and gradient disappearance in the RNN model to a great extent [34]. LSTM introduces a set of storage units and allows historical information to be forgotten at a one-time node during the training and update of the storage units, thus, it is more conducive to processing information over longer distances and is beneficial for managing time-sensitive data [23]. The structure diagram of LSTM is shown in Fig 5.

Download:

Fig 5. Long short-term memory cell structure.

https://doi.org/10.1371/journal.pone.0281519.g005

As shown in Fig 8, the LSTM cell comprises four critical variables: the internal memory, forgetting gate, input gate, and output gate. First, to pass through the forgetting gate which determines the amount of information stored in the previous cell state C_t−1, the forgotten information is stored as the current moment input x_t. Subsequently, the information in the input gate, which is the information to be retained in the input (denoted as i_t), is calculated, and the temporary cell state a_t is maintained. Next, the current cell state C_t is calculated. Finally, the output gate and hidden layer state h_t are calculated. The calculation formulas are as follows: (4) where α denotes the sigmoid function; b_f, b_i, b_o, and b_a denote the output bias. The model constructed utilizes a BiLSTM recurrent network layer, which allows time-series features to be learned from both positive and negative directions and is more conducive to feature extraction [35].

4.2.3 Attention mechanism.

The attention mechanism recognizes crucial information by enhancing focus [36], and its mechanism disregards other unimportant information but focuses more on vital information. A structure model based on the attention mechanism can record the positional relationship between information and measure the importance of specific information features based on the information weight. Dynamic weight parameters are determined by selecting the relevant and irrelevant information features to strengthen the critical information and weaken the ineffective information, thus increasing the efficiency of the deep learning algorithm and improving some defects in conventional deep learning. First, K_t denotes the output processed by CNN and BiLSTM models. K_t is calculated to decide its level of influence on the output value. Subsequently, the softmax function is employed to normalize s_t so that the attention weights a_t can be obtained. Finally, the weight coefficients and the input vector K_t are used to calculate the weighted features, which are shown as follows: (5) where W_h and b_h refer to the weight and bias, respectively.

4.3 Model framework

The apparent features in the fault log data of the supercomputing system were discovered by a CNN to extract the fault features. Subsequently, the data were obtained from the CNN through applying BiLSTM-positive and-negative inputs to extract the fault features with a time series. Finally, the features with a more significant impact on the results can be retrieved by the attention mechanism to enhance the accuracy of fault prediction. The specific structure of the model is illustrated in Fig 6.

Download:

Fig 6. Diagram of model structure.

https://doi.org/10.1371/journal.pone.0281519.g006

After data preprocessing and clustering operations, the obtained data are encoded and processed via the sklearn preprocessing method for the non-numerical components of the data. Subsequently, these data are normalized and transformed for supervised learning.

5. Experiments

This section mainly introduces the experimental part in detail. It is divided into three parts. The first part is a detailed introduction to the data set and model evaluation indicators. The second part will introduce the experimental parameter settings of the comparison experiment. The third part describes the experimental results, which visually display the prediction results of the multivariate time-series model and the comparison test results with the comparison model.

5.1 Dataset description and evaluation indicators

5.1.1 Dataset description.

The fault log of the Shanxi Supercomputing Center cast from 2016 to 2018, which contains 8718121 fault logs, was employed in the experimental data [37]. The system log contains 26 fields, among which 16 are NULL. The following ten invalid fields are removed: the number ID; log record fault occurrence time, ReceivedAt; first time at which failure occurred, DeviceReportedTime; failure device name; facility; failure level priority; failure node number, FromHost; failure message, Message; failure number, InfoUnitID; failure log number, SysLogTag; check code “checksum.”

Since received is the time recorded after the fault is “sensed” by the logging system, it cannot be used as the actual time at which the fault occurred. Therefore, DeviceReportTime is recorded as the occurrence time of the fault which was changed to the date form, whereas the received ID fields were deleted. Due to the fact that the time of failure is uncertain, predicting the time of failure can be regarded as predicting the advanced time of failure. In other words, the time between two failures before and after the prediction and the interval between adjacent failures are calculated as the time interval, and the failure log information with nine fields is obtained. Specific log information is presented in Fig 7.

Download:

Fig 7. Fault log information.

https://doi.org/10.1371/journal.pone.0281519.g007

The fault data were first analyzed in general, and the number of times each node failed was viewed statistically. The results are shown in Fig 8a. In terms of the spatial distribution of system failures and on the foundation, the computer racks contain four computing frames, where each frame contains 32 computing nodes connected through the backplane. Based on this spatial relationship, the spatial probability density diagram of the frame where each failed node is located can be obtained, as shown in Fig 8b. The results show that the first 15 frames present higher risks of failures, which is related to the intensity of the task. Because the failure occurs at uncertain time, it could only be predicted between two failed nodes. The characteristics of the failure time distribution were obtained via analysis, and the results were shown in Fig 8c. Most of the failure intervals were short, indicating that the same failure might have occurred frequently.

Download:

Fig 8. Calculation of number of node failures and their spatial and temporal probability densities.

(a) Number of failures per node; (b) spatial probability map of failures; (c) temporal probability map of faults.

https://doi.org/10.1371/journal.pone.0281519.g008

5.1.2 Evaluation indicators.

The two objectives of predicting the fault occurrence time and the fault occurrence node were assessed based on the MAE and root means square error (RMSE) as statistical performance metrics for the results [38]. The MAE and RMSE are expressed as follows: (6) (7)

Confusion matrix [39], also called error matrix, is a standard format for representing accuracy evaluation in the form of a matrix with n rows and n columns. In this way, the four states of true and predicted values are combined: True Positive (TP): the true category of the sample is positive and the model predicts it to be positive; True Negative (TN): the true category of the sample is negative and the model predicts it to be negative; False Positive (FP): the true category of the sample is negative but the model predicts it to be positive; False Negative (FN): the sample’s true category is positive, but the model predicts it as negative [8].

(8)

5.2 Parameter configuration

For the experiment, an Intel (R) Core(TM) i7-10750H CPU @ 2.60 GHz (12 CPUs) equipped with 16 GB of RAM, Windows 11 64-bit, and NVIDIA GeForce RTX3060 were used. Additionally, Python 3.9, TensorFlow 2.6, and scikit were utilized. The deep learning model was trained by Adam as the optimization function, MAE as the loss function, and 50 as the epoch. Among the data, 80 and 20% were allocated as training and prediction data, respectively. The configuration environment above was used for all models. The model constructed in this paper needs to predict both fault time and location, and the prediction model parameters are set as shown in Table 2, the batch is 72, i.e., 72 data are input to the model at a time, and the activation function uses tanh. Training loss: training loss about time reaches below 0.001, and training loss about node prediction reaches 0.01.

Download:

Table 2. Hyperparameter settings for fault node location prediction model.

https://doi.org/10.1371/journal.pone.0281519.t002

5.3 Experimental results and analysis

5.3.1 Clustering results.

HDBSCAN clustering was performed to process the preprocessed fault logs, and the clustering results revealed five types of fault characteristics. The distribution of the clustering results is shown in Fig 9a, namely Cluster0, Cluster1, Cluster2, Cluster3, and Cluster4. The distribution chart shows that Cluster2, Cluster0, and Cluster4 constitute 37.45%, 20.04%, and 10.99%, respectively. Based on the priority of fault occurrence, the fault priority can be classified into six levels: Priority0, Priority1, Priority2, Priority3, Priority4, and Priority5, where more than 2000 messages exist owing to Priority0, Priority1, and Priority2. Among them, Priority0, Priority1, and Priority2 contain 49, 1004, and 1161 messages, respectively. Because of the low number of messages, these fault priorities were uniformly classified as “others”, and the six priorities were divided into the following four levels: “other”, Priority3, Priority4, and Priority5. As shown in Fig 9b, the higher the priority, the higher the occurrence probability, and the lower the failure severity. Based on the analysis of each clustering category, as shown in Fig 9c, Cluster0 primarily contains the fault priorities of “other,” Priority3, and Priority4, indicating that the data in Cluster0 are relatively severe faults. The occurrence of faults in Cluster1, Cluster2, and Cluster3 present lower rate of failure, which is due to the data of Priority5. Most of the data in Cluster1 and Cluster3 in these three categories are Priority5 data, whose degree of failure is the lowest, whereas Cluster2 contains some failures of Priority3. The failure data of Cluster4 are more complex than those of other clusters, in which various distribution degrees are indicated. Meanwhile, the distribution of Cluster4 is more complex than those of the other clusters, where all distribution degrees and even distributions are indicated. However, the distribution of Priority4 is the highest, indicating that the data in Cluster4 exhibit an intermediate degree of failure. As shown in Fig 9d, the fault logs for Facility1 and Facility6 only contain 51 and 54 messages, respectively, indicating that these two devices do not fail frequently. Among Facility0, Facility2, Facility3, Facility4, and Facility5, Facility5 is the one most prone to failure, which indicates that the supercomputer of Facility5 is easily to be exposed to failure. In addition, the present research compared the locations of the failed devices in each cluster. As shown in Fig 9e, the failure of Cluster0 occurred primarily in Facility0, whereas those of Cluster1 and Cluster3 occurred primarily in Facility3 and Facility4, respectively. Meanwhile, the failure of Cluster2 occurred in Facility4. Among the clusters, Cluster4 was more complicated, and its fault location was randomly distributed.

Download:

Fig 9. HDBSCAN clustering result map.

(a) HDBSCAN clustering result graph, (b) Fault priority distribution diagram, (c) Fault priority distribution diagram of each cluster, (d) Data fault device distribution diagram, and (e) Fault device distribution of each cluster.

https://doi.org/10.1371/journal.pone.0281519.g009

In summary, the fault data were preprocessed by HDBSCAN to categorize the fault category, severity, occurrence location, and susceptibility factor, which enabling more accurate future predictions.

5.3.2 Fault time prediction.

The prediction of fault time from the overall data and the data of each cluster are shown in Fig 10. The effect of the clustered data (b, c, d, e, and f) was more significant than that of the overall data (a) in predicting fault time. The model was based on the clustered preprocessed data, and the prediction results fitted more closely with the training data.

Download:

Fig 10. Fault node time prediction training/validation loss.

Training/validation losses of (a) all data, (b) Cluster0, (c) Cluster1, (d) Cluster2, (e) Cluster3, and (f) Cluster4.

https://doi.org/10.1371/journal.pone.0281519.g010

The results predicted by the model for the comprehensive data and each cluster category is illustrated separately in Table 3. The MAE values for Cluster3 and Cluster0 were 0.011 and 0.249, respectively, and their RMSE values were 0.135 and 2.199, respectively. The variation in the MAE and RMSE values is positively correlated with the complexity of the cluster data composition. This indicates that the model possesses good generalization and prediction abilities.

Download:

Table 3. Evaluation metrics of model for various types of downtime prediction.

https://doi.org/10.1371/journal.pone.0281519.t003

5.3.3 Fault location prediction.

To predict the location of faulty nodes in a system comprising 2048 nodes, the precise ID of each node and its location must be positioned. The results predicted by the model in this study for the location of the faulty nodes are illustrated in Fig 11, and the comparison experiments are similar to the overall data and the data of each cluster. Based on Fig 11, the prediction results obtained from the clustered data indicate that the predictions of the faulty node locations are more efficiently. As indicated in Table 3, better predictions were yielded after clustering was performed by HDBSCAN. The MAE values of Cluster2 and Cluster4 were 1.49 and 6.60, respectively, and the RMSE values of Cluster2 and Cluster0 were 1.49 and 9.55, respectively. The changes in the MAE and RMSE values were positively correlated with the data composition complexity of the clusters, showing that the model possesses good generalization ability in predicting the location of faulty nodes.

Download:

Fig 11. Fault node location prediction training/validation loss.

Training/validation losses for (a) all data, (b) Cluster0, (c) Cluster1, (d) Cluster2, (e) Cluster3, (f) and Cluster4.

https://doi.org/10.1371/journal.pone.0281519.g011

5.3.4 Comparison of models.

To evaluate the predictive power of the multidimensional time series model, experiments were conducted using data collected from a complex Cluster1 to assess the accuracy of the model’s prediction of faulty node locations. For training, a batch size of 256 and an epoch of 50 were used, while other variables were kept constant. The experimental results are shown in Table 4. The prediction accuracy of the proposed model outperforms the SVR, XGBOOST, LSTM and other methods. This is attributed to the HDBSCAN clustering preprocessing capability and the fusion mechanism of the CBA network model, since the CNN-BiLSTM model is able to mine the temporal and spatial features in the fault logs, while the attention mechanism is able to efficiently load sufficient information about the features.

Download:

Table 4. Performance of CNN–BiLSTM–attention model in comparison with those of other models.

https://doi.org/10.1371/journal.pone.0281519.t004

Table 5 Performance of CNN–BiLSTM–attention model in comparison with those of other models. The experiments were conducted by applying fault log data. Our proposed model is compared with 5 fault prediction models and 2 ablation experiments are conducted, and the results show that our proposed multidimensional time series model has better granularity (time and location) and prediction accuracy in fault prediction of supercomputing systems.

Download:

Table 5. Evaluation metrics for model prediction of various fault node types.

https://doi.org/10.1371/journal.pone.0281519.t005

6. Results and discussion

In this paper, we propose a data preprocessing method based on HDBSCAN clustering to classify faults, and then use CNN-BiLSTM-Attention to build a multidimensional network model to train the preprocessed data. The multidimensional model can effectively extract and fuse the spatial and temporal features in fault logs, and has the advantages of high sensitivity to time series features and sufficient extraction of local features compared with traditional machine learning methods. The average prediction accuracy can reach more than 93%. Although the method proposed in this paper can provide a reference for reliability research of supercomputing systems or intelligent computing systems, and good experimental results have been achieved in practical prediction, the prediction system in this paper is based on historical data, which has insufficient response to real-time fault data and large computational and bandwidth overheads, while the fault data through preprocessing can be used not only for fault analysis and prediction but also for fault-tolerant recovery of the system. In future research, we will improve the speed of data acquisition and pre-processing, optimize the fault analysis and prediction mechanism, and use the mechanism for fault-tolerant recovery of the system. The granularity and accuracy of fault prediction classification will be further improved to reduce the impact of increasing node computation and network overhead during the operation of the prediction model. Second, the scope of prediction can be extended to energy efficiency. This challenge is important for supercomputing providers to minimize costs. In addition, the application of migration learning techniques can be explored to provide a useful reference for fault-tolerant frameworks for supercomputing systems.

Supporting information

S1 Appendix.

https://doi.org/10.1371/journal.pone.0281519.s001

(DOCX)

References

1. Das A, Mueller F, Siegel C, Vishnu A. Desh: deep learning for system health prediction of lead times to failure in HPC. Proceedings of the 27th Internationa Symposium on High-Performance Parallel and Distributed Computing. New York, NY, USA: Association for Computing Machinery; 2018. pp. 40–51.
2. Roman E, Das A, Mueller F, Hargrove PH. Pin-pointing Node Failures in HPC Systems. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); 2020 Mar. https://www.osti.gov/biblio/1605274.
3. Molan M, Borghesi A, Beneventi F, Guarrasi M, Bartolini A. An Explainable Model for Fault Detection in HPC Systems. In: Jagode H, Anzt H, Ltaief H, Luszczek P, editors. High Performance Computing. Cham: Springer International Publishing; 2021. pp. 378–391.
4. Mao G, Zeng R, Peng J, Zuo K, Pang Z, Liu J. Reconstructing gene regulatory networks of biological function using differential equations of multilayer perceptrons. BMC Bioinformatics. 2022;23: 503. pmid:36434499
- View Article
- PubMed/NCBI
- Google Scholar
5. Zhu L, Gu J, Wang Y, Zhao T, Cai Z. Optimizing the fault-tolerance overheads of HPC systems using prediction and multiple proactive actions. J Supercomput. 2015;71: 3668–3694.
- View Article
- Google Scholar
6. Bouguerra MS, Gainaru A, Gomez LB, Cappello F, Matsuoka S, Maruyama N. Improving the Computing Efficiency of HPC Systems Using a Combination of Proactive and Preventive Checkpointing. 2013 IEEE 27th International Symposium on Parallel and Distributed Processing. 2013. pp. 501–512.
7. Tuli S, Casale G, Jennings NR. PreGAN: Preemptive Migration Prediction Network for Proactive Fault-Tolerant Edge Computing. IEEE INFOCOM 2022—IEEE Conference on Computer Communications. 2022. pp. 670–679.
8. Frank A, Yang D, Brinkmann A, Schulz M, Süss T. Reducing False Node Failure Predictions in HPC. 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC). 2019. pp. 323–332.
9. Hu W, Jiang Y, Liu G, Dong W, Cai G. DDC: Distributed Data Collection Framework for Failure Prediction in Tianhe Supercomputers. In: Chen Y, Ienne P, Ji Q, editors. Advanced Parallel Processing Technologies. Cham: Springer International Publishing; 2015. pp. 18–32.
10. Ebert C, Gallardo G, Hernantes J, Serrano N. DevOps. IEEE Software. 2016;33: 94–100.
- View Article
- Google Scholar
11. Zhu L, Bass L, Champlin-Scharff G. DevOps and Its Practices. IEEE Software. 2016;33: 32–34.
- View Article
- Google Scholar
12. Dang Y, Lin Q, Huang P. AIOps: Real-World Challenges and Research Innovations. 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). 2019. pp. 4–5.
13. Masood A, Hashmi A. AIOps: Predictive Analytics & Machine Learning in Operations. Cognitive Computing Recipes. 2019; 359–382.
- View Article
- Google Scholar
14. AIOps: Predictive Analytics & Machine Learning in Operations | SpringerLink. [cited 16 Sep 2022]. https://link.springer.com/chapter/10.1007/978-1-4842-4106-6_7.
15. Wang W, Yang X, Yang C, Guo X, Zhang X, Wu C. Dependency-based long short term memory network for drug-drug interaction extraction. BMC Bioinformatics. 2017;18: 578. pmid:29297301
- View Article
- PubMed/NCBI
- Google Scholar
16. Gainaru A, Cappello F, Kramer W. Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems. 2012 IEEE 26th International Parallel and Distributed Processing Symposium. 2012. pp. 1168–1179.
17. Zhong J. Study on Adaptive Failure Prediction Algorithm for Supercomputer. J Inf Comput Sci. 2015;12: 3697–3704.
- View Article
- Google Scholar
18. Jauk D, Yang D, Schulz M. Predicting faults in high performance computing systems: an in-depth survey of the state-of-the-practice. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. New York, NY, USA: Association for Computing Machinery; 2019. pp. 1–13.
19. Shetty J, Sajjan R, G. S. Task Resource Usage Analysis and Failure Prediction in Cloud. 2019 9th International Conference on Cloud Computing, Data Science & Engineering (Confluence). 2019. pp. 342–348.
20. Gainaru A, Cappello F, Snir M, Kramer W. Fault prediction under the microscope: A closer look into HPC systems. SC ‘12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 2012. pp. 1–11.
21. Office FE. Online Failure Prediction in Cloud Datacenters. FUJITSU Sci Tech J. 2014;50.
- View Article
- Google Scholar
22. Bhanage DA, Pawar AV, Kotecha K. IT Infrastructure Anomaly Detection and Failure Handling: A Systematic Literature Review Focusing on Datasets, Log Preprocessing, Machine & Deep Learning Approaches and Automated Tool. IEEE Access. 2021;9: 156392–156421.
- View Article
- Google Scholar
23. Ju J, Liu F-A. Multivariate Time Series Data Prediction Based on ATT-LSTM Network. Applied Sciences. 2021;11: 9373.
- View Article
- Google Scholar
24. Chen X, Lu C-D, Pattabiraman K. Failure Analysis of Jobs in Compute Clouds: A Google Cluster Case Study. 2014 IEEE 25th International Symposium on Software Reliability Engineering. 2014. pp. 167–177.
25. Zhu B, Wang G, Liu X, Hu D, Lin S, Ma J. Proactive drive failure prediction for large scale storage systems. 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST). Long Beach, CA, USA: IEEE; 2013. pp. 1–5.
26. Nie B, Xue J, Gupta S, Engelmann C, Smirni E, Tiwari D. Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities. 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). 2017. pp. 22–31.
27. Srinivasu PN, SivaSai JG, Ijaz MF, Bhoi AK, Kim W, Kang JJ. Classification of Skin Disease Using Deep Learning Neural Networks with MobileNet V2 and LSTM. Sensors. 2021;21: 2852. pmid:33919583
- View Article
- PubMed/NCBI
- Google Scholar
28. Islam T, Manivannan D. Predicting Application Failure in Cloud: A Machine Learning Approach. 2017 IEEE International Conference on Cognitive Computing (ICCC). 2017. pp. 24–31.
29. McInnes L, Healy J, Astels S. hdbscan: Hierarchical density based clustering. JOSS. 2017;2: 205.
- View Article
- Google Scholar
30. Behera M, Sarangi A, Mishra D, Mallick PK, Shafi J, Srinivasu PN, et al. Automatic Data Clustering by Hybrid Enhanced Firefly and Particle Swarm Optimization Algorithms. Mathematics. 2022;10: 1–29.
- View Article
- Google Scholar
31. Khan K, Rehman SU, Aziz K, Fong S, Sarasvady S. DBSCAN: Past, present and future. The Fifth International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2014). 2014. pp. 232–238.
32. Gowanlock M. Hybrid CPU/GPU clustering in shared memory on the billion point scale. Proceedings of the ACM International Conference on Supercomputing. Phoenix Arizona: ACM; 2019. pp. 35–45.
33. Greff K, Srivastava RK, Koutník J, Steunebrink BR, Schmidhuber J. LSTM: A Search Space Odyssey. IEEE Transactions on Neural Networks and Learning Systems. 2017;28: 2222–2232. pmid:27411231
- View Article
- PubMed/NCBI
- Google Scholar
34. An Q, Tao Z, Xu X, El Mansori M, Chen M. A data-driven model for milling tool remaining useful life prediction with convolutional and stacked LSTM network. Measurements. 2020;154: 107461.
- View Article
- Google Scholar
35. Staudemeyer RC, Morris ER. Understanding LSTM—a tutorial into Long Short-Term Memory Recurrent Neural Networks. arXiv; 2019 Sep. Report No.: arXiv:1909.09586.
36. Duan S, Zhao H. Attention Is All You Need for Chinese Word Segmentation. arXiv; 2020 Oct. Report No.: arXiv:1910.14537.
37. https://github.com/YMyyds/Shanxi-Supercomputing-Center-Fault-Data1.
38. Wang J, Li J, Wang X, Wang T, Sun Q. An air quality prediction model based on CNN-BiNLSTM-attention. Environ Dev Sustain. 2022 [cited 16 Sep 2022].
- View Article
- Google Scholar
39. Townsend JT. Theoretical analysis of an alphabetic confusion matrix. Perception & Psychophysics. 1971;9: 40–50.
- View Article
- Google Scholar

[ref1] 1. Das A, Mueller F, Siegel C, Vishnu A. Desh: deep learning for system health prediction of lead times to failure in HPC. Proceedings of the 27th Internationa Symposium on High-Performance Parallel and Distributed Computing. New York, NY, USA: Association for Computing Machinery; 2018. pp. 40–51.

[ref2] 2. Roman E, Das A, Mueller F, Hargrove PH. Pin-pointing Node Failures in HPC Systems. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); 2020 Mar. https://www.osti.gov/biblio/1605274.

[ref3] 3. Molan M, Borghesi A, Beneventi F, Guarrasi M, Bartolini A. An Explainable Model for Fault Detection in HPC Systems. In: Jagode H, Anzt H, Ltaief H, Luszczek P, editors. High Performance Computing. Cham: Springer International Publishing; 2021. pp. 378–391.

[ref4] 4. Mao G, Zeng R, Peng J, Zuo K, Pang Z, Liu J. Reconstructing gene regulatory networks of biological function using differential equations of multilayer perceptrons. BMC Bioinformatics. 2022;23: 503. pmid:36434499
View Article
PubMed/NCBI
Google Scholar

[5] View Article

[6] PubMed/NCBI

[7] Google Scholar

[ref5] 5. Zhu L, Gu J, Wang Y, Zhao T, Cai Z. Optimizing the fault-tolerance overheads of HPC systems using prediction and multiple proactive actions. J Supercomput. 2015;71: 3668–3694.
View Article
Google Scholar

[9] View Article

[10] Google Scholar

[ref6] 6. Bouguerra MS, Gainaru A, Gomez LB, Cappello F, Matsuoka S, Maruyama N. Improving the Computing Efficiency of HPC Systems Using a Combination of Proactive and Preventive Checkpointing. 2013 IEEE 27th International Symposium on Parallel and Distributed Processing. 2013. pp. 501–512.

[ref7] 7. Tuli S, Casale G, Jennings NR. PreGAN: Preemptive Migration Prediction Network for Proactive Fault-Tolerant Edge Computing. IEEE INFOCOM 2022—IEEE Conference on Computer Communications. 2022. pp. 670–679.

[ref8] 8. Frank A, Yang D, Brinkmann A, Schulz M, Süss T. Reducing False Node Failure Predictions in HPC. 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC). 2019. pp. 323–332.

[ref9] 9. Hu W, Jiang Y, Liu G, Dong W, Cai G. DDC: Distributed Data Collection Framework for Failure Prediction in Tianhe Supercomputers. In: Chen Y, Ienne P, Ji Q, editors. Advanced Parallel Processing Technologies. Cham: Springer International Publishing; 2015. pp. 18–32.

[ref10] 10. Ebert C, Gallardo G, Hernantes J, Serrano N. DevOps. IEEE Software. 2016;33: 94–100.
View Article
Google Scholar

[16] View Article

[17] Google Scholar

[ref11] 11. Zhu L, Bass L, Champlin-Scharff G. DevOps and Its Practices. IEEE Software. 2016;33: 32–34.
View Article
Google Scholar

[19] View Article

[20] Google Scholar

[ref12] 12. Dang Y, Lin Q, Huang P. AIOps: Real-World Challenges and Research Innovations. 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). 2019. pp. 4–5.

[ref13] 13. Masood A, Hashmi A. AIOps: Predictive Analytics & Machine Learning in Operations. Cognitive Computing Recipes. 2019; 359–382.
View Article
Google Scholar

[23] View Article

[24] Google Scholar

[ref14] 14. AIOps: Predictive Analytics & Machine Learning in Operations | SpringerLink. [cited 16 Sep 2022]. https://link.springer.com/chapter/10.1007/978-1-4842-4106-6_7.

[ref15] 15. Wang W, Yang X, Yang C, Guo X, Zhang X, Wu C. Dependency-based long short term memory network for drug-drug interaction extraction. BMC Bioinformatics. 2017;18: 578. pmid:29297301
View Article
PubMed/NCBI
Google Scholar

[27] View Article

[28] PubMed/NCBI

[29] Google Scholar

[ref16] 16. Gainaru A, Cappello F, Kramer W. Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems. 2012 IEEE 26th International Parallel and Distributed Processing Symposium. 2012. pp. 1168–1179.

[ref17] 17. Zhong J. Study on Adaptive Failure Prediction Algorithm for Supercomputer. J Inf Comput Sci. 2015;12: 3697–3704.
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref18] 18. Jauk D, Yang D, Schulz M. Predicting faults in high performance computing systems: an in-depth survey of the state-of-the-practice. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. New York, NY, USA: Association for Computing Machinery; 2019. pp. 1–13.

[ref19] 19. Shetty J, Sajjan R, G. S. Task Resource Usage Analysis and Failure Prediction in Cloud. 2019 9th International Conference on Cloud Computing, Data Science & Engineering (Confluence). 2019. pp. 342–348.

[ref20] 20. Gainaru A, Cappello F, Snir M, Kramer W. Fault prediction under the microscope: A closer look into HPC systems. SC ‘12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 2012. pp. 1–11.

[ref21] 21. Office FE. Online Failure Prediction in Cloud Datacenters. FUJITSU Sci Tech J. 2014;50.
View Article
Google Scholar

[38] View Article

[39] Google Scholar

[ref22] 22. Bhanage DA, Pawar AV, Kotecha K. IT Infrastructure Anomaly Detection and Failure Handling: A Systematic Literature Review Focusing on Datasets, Log Preprocessing, Machine & Deep Learning Approaches and Automated Tool. IEEE Access. 2021;9: 156392–156421.
View Article
Google Scholar

[41] View Article

[42] Google Scholar

[ref23] 23. Ju J, Liu F-A. Multivariate Time Series Data Prediction Based on ATT-LSTM Network. Applied Sciences. 2021;11: 9373.
View Article
Google Scholar

[44] View Article

[45] Google Scholar

[ref24] 24. Chen X, Lu C-D, Pattabiraman K. Failure Analysis of Jobs in Compute Clouds: A Google Cluster Case Study. 2014 IEEE 25th International Symposium on Software Reliability Engineering. 2014. pp. 167–177.

[ref25] 25. Zhu B, Wang G, Liu X, Hu D, Lin S, Ma J. Proactive drive failure prediction for large scale storage systems. 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST). Long Beach, CA, USA: IEEE; 2013. pp. 1–5.

[ref26] 26. Nie B, Xue J, Gupta S, Engelmann C, Smirni E, Tiwari D. Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities. 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). 2017. pp. 22–31.

[ref27] 27. Srinivasu PN, SivaSai JG, Ijaz MF, Bhoi AK, Kim W, Kang JJ. Classification of Skin Disease Using Deep Learning Neural Networks with MobileNet V2 and LSTM. Sensors. 2021;21: 2852. pmid:33919583
View Article
PubMed/NCBI
Google Scholar

[50] View Article

[51] PubMed/NCBI

[52] Google Scholar

[ref28] 28. Islam T, Manivannan D. Predicting Application Failure in Cloud: A Machine Learning Approach. 2017 IEEE International Conference on Cognitive Computing (ICCC). 2017. pp. 24–31.

[ref29] 29. McInnes L, Healy J, Astels S. hdbscan: Hierarchical density based clustering. JOSS. 2017;2: 205.
View Article
Google Scholar

[55] View Article

[56] Google Scholar

[ref30] 30. Behera M, Sarangi A, Mishra D, Mallick PK, Shafi J, Srinivasu PN, et al. Automatic Data Clustering by Hybrid Enhanced Firefly and Particle Swarm Optimization Algorithms. Mathematics. 2022;10: 1–29.
View Article
Google Scholar

[58] View Article

[59] Google Scholar

[ref31] 31. Khan K, Rehman SU, Aziz K, Fong S, Sarasvady S. DBSCAN: Past, present and future. The Fifth International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2014). 2014. pp. 232–238.

[ref32] 32. Gowanlock M. Hybrid CPU/GPU clustering in shared memory on the billion point scale. Proceedings of the ACM International Conference on Supercomputing. Phoenix Arizona: ACM; 2019. pp. 35–45.

[ref33] 33. Greff K, Srivastava RK, Koutník J, Steunebrink BR, Schmidhuber J. LSTM: A Search Space Odyssey. IEEE Transactions on Neural Networks and Learning Systems. 2017;28: 2222–2232. pmid:27411231
View Article
PubMed/NCBI
Google Scholar

[63] View Article

[64] PubMed/NCBI

[65] Google Scholar

[ref34] 34. An Q, Tao Z, Xu X, El Mansori M, Chen M. A data-driven model for milling tool remaining useful life prediction with convolutional and stacked LSTM network. Measurements. 2020;154: 107461.
View Article
Google Scholar

[67] View Article

[68] Google Scholar

[ref35] 35. Staudemeyer RC, Morris ER. Understanding LSTM—a tutorial into Long Short-Term Memory Recurrent Neural Networks. arXiv; 2019 Sep. Report No.: arXiv:1909.09586.

[ref36] 36. Duan S, Zhao H. Attention Is All You Need for Chinese Word Segmentation. arXiv; 2020 Oct. Report No.: arXiv:1910.14537.

[ref37] 37. https://github.com/YMyyds/Shanxi-Supercomputing-Center-Fault-Data1.

[ref38] 38. Wang J, Li J, Wang X, Wang T, Sun Q. An air quality prediction model based on CNN-BiNLSTM-attention. Environ Dev Sustain. 2022 [cited 16 Sep 2022].
View Article
Google Scholar

[73] View Article

[74] Google Scholar

[ref39] 39. Townsend JT. Theoretical analysis of an alphabetic confusion matrix. Perception & Psychophysics. 1971;9: 40–50.
View Article
Google Scholar

[76] View Article

[77] Google Scholar

Figures

Abstract

1. Introduction

1.1 Time series-based representation

1.2 Spatial feature-based representation

2. Motivation

3. Related studies

4. Multivariate time-series model

4.1 Data preprocessing based on HDBSCAN

4.2 Methodology

4.2.1 Convolutional neural networks.

4.2.2 BiLSTM prediction network.

4.2.3 Attention mechanism.

4.3 Model framework

5. Experiments

5.1 Dataset description and evaluation indicators

5.1.1 Dataset description.

5.1.2 Evaluation indicators.

5.2 Parameter configuration

5.3 Experimental results and analysis

5.3.1 Clustering results.

5.3.2 Fault time prediction.

5.3.3 Fault location prediction.

5.3.4 Comparison of models.

6. Results and discussion

Supporting information

S1 Appendix.

References