Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Application of multivariate time-series model for high performance computing (HPC) fault prediction

  • Xiangdong Pei,

    Roles Formal analysis, Methodology, Software, Writing – original draft

    Affiliations College of Computer, National University of Defense Technology, Changsha, China, Shanxi Supercomputing Center, Lvliang, China

  • Min Yuan,

    Roles Data curation

    Affiliation Shanxi Supercomputing Center, Lvliang, China

  • Guo Mao,

    Roles Writing – review & editing

    Affiliation College of Computer, National University of Defense Technology, Changsha, China

  • Zhengbin Pang

    Roles Project administration, Software

    zbpang@nudt.edu.cn

    Affiliation College of Computer, National University of Defense Technology, Changsha, China

Abstract

Aiming at the high reliability demand of increasingly large and complex supercomputing systems, this paper proposes a multidimensional fusion CBA-net (CNN-BiLSTAM-Attention) fault prediction model based on HDBSCAN clustering preprocessing classification data, which can effectively extract and learn the spatial and temporal features in the predecessor fault log. The model can effectively extract and learn the spatial and temporal features from the predecessor fault logs, and has the advantages of high sensitivity to time series features and sufficient extraction of local features, etc. The RMSE of the model for fault occurrence time prediction is 0.031, and the prediction accuracy of node location for fault occurrence is 93% on average, as demonstrated by experiments. The model can achieve fast convergence and improve the fine-grained and accurate fault prediction of large supercomputers.

1. Introduction

In recent years, owing to the increasing demand for high-performance computing (HPC) as well as the scale-up supercomputers and intelligent computing systems, the reliability of large-scale computing systems has been investigated extensively [14]. The system operation is complex, and failures occur frequently which are difficult to detect, locate, diagnose, analyze, and debug [1,5,6]. The existing system health check monitoring and techniques generally monitor faults through different log sources, such as root cause diagnosis and fault detection. However, they still lack the means to proactively handle faults in the face of more complex large-scale supercomputer systems. First, the complexity of supercomputer systems is determined by their novel architectures, continuously updated designs, constantly upgraded applications, and flexible logging mechanisms. Existing fault self-diagnosis techniques are inadequate to cope with these complex changes [69]. Due to the increasing application of artificial intelligence, big data, and the rapid development of computing hardware and applications, the operation and maintenance approach has evolved from DevOps (Development Operations) [10,11] to AIOps (Artificial Intelligence for IT Operations) [1214]. The intelligent operation and maintenance approach can be combined with big data, AI machine learning, and other technologies to support the operational functions of IT equipment through proactive, personalized, and dynamic insights. The AIOps platform supports the simultaneous use of multiple data sources, data acquisition analytics (real-time and deep), and representation technologies. Intelligent operation and maintenance algorithms are emerging technologies that integrate deep learning, time-series data, anomaly detection, and root cause localization in multiple dimensions.

The physical architecture of a supercomputer typically contains a log collection service system where system logs are collected in real-time for feedback. It is a system that allows administrators to understand the system status and fault events whenever necessary. Fault data includes multidimensional attributes of fault events, whereas various attribute elements are highly correlated and are primarily categorized into temporal and spatial correlations.

1.1 Time series-based representation

Temporal correlation in supercomputers refers to the following two aspects: first, specific faults can cause multiple faults on multiple nodes in a short period; second, the same fault can occur multiple times on a node before the root cause is identified and resolved. Spatial autocorrelation refers to the potential interdependence between the observations of variables within the same distribution. System failure prediction aims to predict possible failures that may occur during operation based on the current system state. The fault-prediction task is illustrated in Fig 1.

Time series-based representation: At the current moment t, the possible failures are predicted in advance based on the observed system state by monitoring the system with a data window of length Δtd. The advanced time is called the lead time Δtl. The length of time Δtp represents the validity of the prediction, also known as the prediction period. Increasing Δtp increases the probability that a failure will be correctly predicted. Δtw is the minimum warning period, which is the minimum time needed to take preventive measures. If the lead time Δtl is shorter than the minimum warning period Δtw, the preventive measures would not be taken on time.

1.2 Spatial feature-based representation

Spatial correlation owns two characteristics. First, ineviTablele failures can occur (almost) synchronously in the same subsystem or multiple nodes at the boundary of the subsystem, such as failures in high-speed interconnections and file storage. Second, errors occurring in one node can trigger other errors in different nodes [15]. The research object of this work is a supercomputer deployed at the Shanxi Provincial Supercomputing Center, which possesses the following characteristics: the computer rack contains four computing frames, where each frame contains 32 computing nodes, a switching board, and a display board, all of which are connected through a backplane. Moreover, the system comprises 16 computer racks with 2048 computing nodes being involved. The supercomputer is deployed with a visualization system to view the failure situation in real-time, as shown in Fig 2(a), where the red highlighted region can provide warnings regarding excessively high temperatures and memory overflow in real-time. Thus, the research object of this work exhibits both temporal and spatial attributes(Fig 2(b)), and for the convenience of vectorized data processing, it can be abstracted as data cubes. The reliability of computational systems has been improved to a certain extent hitherto [11,12]. However, fault prediction methods based on conventional machine learning and neural networks are merely applicable to a limited are of computational platforms, whereas localization and prediction accuracy require further improvement. The contributions of this work are as follows:

  1. A fault log preprocessing mechanism based on HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) clustering is introduced to extract multivariate feature information from the low-dimensional space of fault logs.
  2. Based on intelligent operation and maintenance, a fault prediction model of multivariate time-series (CNN–BiLSTM–attention) is proposed, which processes classification data based on HDBSCAN clustering and affords rapid convergence to improve prediction accuracy.
  3. The proposed multidimensional model can effectively extract and fuse spatial and temporal features in fault logs, and it is highly sensitive to time-series features. Besides, it can extract local features more effectively compared with conventional machine learning methods. Experimentally, the model yields results in more effective fault prediction of large computing systems to support decision-making and system management.
thumbnail
Fig 2. Supercomputer system fault monitoring viewable and time- and space-based fault log data cubes.

https://doi.org/10.1371/journal.pone.0281519.g002

The rest of the article is organized as follows. Section 2 describes our motivation for conducting this study. Section 3 reviews the related research work. Section 4 provides a detailed description of the overall approach of the Multivariate Time-Series Model. Section 5 describes the procedure of the experiment. Then the conclusion and discussion are in section 6.

2. Motivation

With the increasing complexity of supercomputer systems, the types of faults in supercomputers are becoming more and more complex. The traditional unitary fault tolerance strategies such as system checkpointing techniques are difficult to adapt to fault tolerance for complex system failures [8]. Identifying the intrinsic fault association characteristics of the system through statistical laws can be applied to fault prediction and lightweight pre-processing of the system, which is a key way to achieve active fault tolerance in supercomputers [16]. Based on the fault log data to discover the failure occurrence law of major computing components, and for the problem of quantitative description of failure time of major computing components in supercomputers [6], the failure data of supercomputers are analyzed according to time and space dimensions, and a multi-dimensional unified failure time model adapted to supercomputers is esTablelished, and through the synergistic analysis of applications and failures, the impact of different applications on system failures is discovered, which can develop targeted fault-tolerance strategies. The existing research work has improved the reliability of computing systems to a certain extent, however, the fault prediction methods based on traditional machine learning and neural networks only target a specific class of computing platforms, and the localization and prediction accuracy need to be further improved. We introduce a fault log preprocessing mechanism based on HDBSCAN clustering in the process of carrying out fault prediction for large-scale supercomputing systems, and first extract multivariate feature information from the low-dimensional space of fault logs. Then a multidimensional fusion network prediction model is constructed, which can effectively learn and fuse the spatial and temporal features in fault logs, and has the advantages of high sensitivity to time series features and adequate extraction of local features compared with traditional machine learning methods.

3. Related studies

Fault prediction is vital to supercomputing system reliability research. Therefore, the fault tolerance of supercomputing systems has been investigated extensively [7,17]. The present research primarily focus on highlighting fault sources and developing the corresponding prediction mechanisms [18]. Das et al. also propose a machine learning method that uses short-term memory networks to predict node failures with three minutes lead time, 85% recall, and 83% accuracy [1]. Frank et al. based on multiple, independently trained neural networks using different lead-up time offsets, combined with simple majority voting where a consensus among neural networks is required to issue a positive (failure) final prediction [8]. Shetty et al. used the XGboost classifier for prediction class prediction based on task failure features on the Google cluster dataset and achieved high prediction accuracy [19]. Gainaru et al. proposed a signal-based fault prediction method to identify regular times in system logs as signal data and employed algorithms to mine progressive association rules to calculate the temporal relationship between events and the favorable results were ultimately obtained [20]. Fujitsu Laboratories has developed a method to create and learn message patterns in real time, which is based on the fault prediction technique of message pattern learning, and has obtained evaluation results of its performance by obtaining messages online for experimental fault prediction in a real cloud data center. However, the success rate of its prediction needs to be improved [21]. However, these methods require overly complex feature extractions, and the models cannot be easily adapted to the scale of the system. Currently, machine learning has been widely used to extract features from log data [22]. Ju et al. applied the attention mechanism to LSTM, enabling LSTM to screen multiple sequences, remove irrelevant redundant information, and capture information about interactions between sequences [23]. Chen et al. used an RNN to predict the probability of job failure from a task, and the prediction results afforded the conservation of system resources despite their low accuracy [24]. Zhu et al. employed a support vector machine and neural network methods to predict hard-disk failures [25]. Nie et al. used a GPU to analyze the correlation among temperature, power, and error. And they finally proposed a neural network-based prediction method and predicted four cabinets on a TITAN supercomputer with 82% accuracy [26]. Islam et al. proposed the use of LSTM for prediction [27], which was not completely accurate but facilitated the conservation of system resources [28]. Although these methods solve problems pertaining to feature extraction, they could not reveal the dependencies between faults, and none of the prediction results yielded are satisfactory. Table 1 summarizes the related studies of traditional methods and deep learning methods in fault prediction of large-scale complex computing systems.

thumbnail
Table 1. Related research of traditional methods and deep learning methods in fault prediction of large-scale complex computing systems.

https://doi.org/10.1371/journal.pone.0281519.t001

4. Multivariate time-series model

In this section, the architecture diagram integrating HDBSCAN, CNN, LSTM, andattention is explained. HDBSCAN is used to cluster data with different faults and to preprocess the fault data. The first layer of the CBA network model is the CNN layer, whose main role is to extract the local temporal and spatial features of the fault logs, while BiLSTM is used to maintain multivariate time-series fault features while predicting the next state, and finally the attention mechanism is used to enhance the features with high impact on the results to further improve the accuracy of fault prediction.

4.1 Data preprocessing based on HDBSCAN

Data clustering is a process of arranging similar data in different groups based on certain characteristics and properties, and each group is considered as a cluster [23]. Resulting from the causes of supercomputing system failures, such as hardware, software, and operational failures, numerous categories of failure logs exist. For noisy log data, the present research applies the HDBSCAN [29] algorithm primarily to process the fault logs and combines data with similar characteristics to obtain more accurate prediction results [30].

The HDBSCAN clustering algorithm is an improved version of the density-based clustering algorithm DBSCAN [31]. It is a clustering method that combines the DBSCAN algorithm and the hierarchical clustering algorithm. The DBSCAN clustering algorithm yields better results than other clustering algorithms on anomalous data datasets [32]. However, it could merely cluster data with same density distribution, and the clustering process requires the adjustment of two parameters, i.e., Minpts (the minimum step size) and Eps (the domain radius), thus restricting the use of the DBSCAN clustering algorithm. Hence, hierarchical clustering is introduced into the HDBSCAN clustering algorithm, where the method for measuring the distance between two points is redefined as follows: (1) where dmreach−k (a, b) refers to the mutual reachable distance between two points a and b, and d(a,b) is the Euclidean distance between a and b. The clustering algorithm uses the minimum spanning tree to construct the hierarchical tree model between points, which implies that only the minimum number of clusters (min_cluster_size) is to be defined in the algorithm to obtain the optimal clustering results. Therefore, complicated tuning can be avoided, and the clustering accuracy and applicability can be improved. Alg 1. shows the pseudo-code of HDBSCAN. The main steps of HDBSCAN are as follows: transforming the space→ building the minimum spanning tree→ building the cluster hierarchy→ condensing the cluster tree→ extracting the clusters.

Algorithm 1. HDBSCAN clustering pseudocode.

Input: Location data: LD, Parameter: Eps and Minpts,S-Tree: Height

Output: LD with cluster lable and Spatial_Tree was built

1. DBSCAN_ OBJECT Root = Joint(LD,Eps,Minpts); // root node of Tree

2. enqueue(Q, Root); // push DBSCAN object into Queue

3. front: = 0, last: = 0, level = 0;

4. while(Queue<>empty and front< = last) do

5.  DBSCAN_ OBJECT node = DEQueue(Q); // Pull data from Queue

6.  front++;//

7.  Data_OBJECT Childern = DBSCAN.getCluster(node); //Call DBSCAN

8.  if(level > Height)

9.    break;

10.   for i FROM 1 TO Childern.size do

11.    Data child = Childern.get(i);

12.    DBSCAN_ OBJECT Root = Joint(child,Eps,Minpts);

13.    enqueue(Q,DBSCAN_ OBJECT);

14.   end for

15.   if(front>last) // members in one level have been searched

16.    last = Q.size()+front-1;

17.    level ++;

18.    end if

19. end while

4.2 Methodology

4.2.1 Convolutional neural networks.

The first layer of the model is a CNN (Convolutional Neural Network) layer, whose primary role is to extract the local features of the fault logs. Fig 3 shows the Convolutional Neural Networks.

The extraction of fault feature information with a time series by a 1D CNN is primarily performed by filters in the convolutional layer, which contains amounts of kernels. Each kernel comprises acceptable log information fields, and each layer is convolved by a modified linear unit (ReLU) activation function as follows: (2)

After the activation function modifies the negative values and solves the gradient disappearance and gradient explosion problems, feature mapping is performed by the filter via the following equation: (3) where is the output of the nth filter in convolutional layer m, f the activation function, the weight of the convolutional kernel, the bias, and x the input feature vector. Finished in convolutional layer, the features are dimensioned through the max-pooling layer to compress the data and decrease the number of parameters to prevent overfitting.

4.2.2 BiLSTM prediction network.

Considering the time-dynamical nature of supercomputer systems. Conventional supervised learning methods, such as logistic regression, support vector machines, and tree-based classifiers, only consider input sequences as independent features but could not capture the temporal dependence between them. In this study, recurrent neural networks (RNNs) were applied to the system to overcome the disadvantages of conventional learning methods. Nonetheless, classical RNNs lack the function of storing previous input information for a long duration, which weakens their ability to model the long-range structure of the input sequences. LSTM(Long Short-Term Memory)is an RNN architecture that aims to improve the ability of RNNs in storing and accessing information [33]. In this work, an LSTM-based prediction network was applied to model the dynamic properties of computer systems(Fig 4), which is conducted based on the significant time dependence of fault prediction in the computational systems described above.

Since fault log information is time-based serial information, temporal characteristics are critical for predicting faults. LSTM is an improved version of the RNN model [33], which solves the problem of gradient explosion and gradient disappearance in the RNN model to a great extent [34]. LSTM introduces a set of storage units and allows historical information to be forgotten at a one-time node during the training and update of the storage units, thus, it is more conducive to processing information over longer distances and is beneficial for managing time-sensitive data [23]. The structure diagram of LSTM is shown in Fig 5.

As shown in Fig 8, the LSTM cell comprises four critical variables: the internal memory, forgetting gate, input gate, and output gate. First, to pass through the forgetting gate which determines the amount of information stored in the previous cell state Ct−1, the forgotten information is stored as the current moment input xt. Subsequently, the information in the input gate, which is the information to be retained in the input (denoted as it), is calculated, and the temporary cell state at is maintained. Next, the current cell state Ct is calculated. Finally, the output gate and hidden layer state ht are calculated. The calculation formulas are as follows: (4) where α denotes the sigmoid function; bf, bi, bo, and ba denote the output bias. The model constructed utilizes a BiLSTM recurrent network layer, which allows time-series features to be learned from both positive and negative directions and is more conducive to feature extraction [35].

4.2.3 Attention mechanism.

The attention mechanism recognizes crucial information by enhancing focus [36], and its mechanism disregards other unimportant information but focuses more on vital information. A structure model based on the attention mechanism can record the positional relationship between information and measure the importance of specific information features based on the information weight. Dynamic weight parameters are determined by selecting the relevant and irrelevant information features to strengthen the critical information and weaken the ineffective information, thus increasing the efficiency of the deep learning algorithm and improving some defects in conventional deep learning. First, Kt denotes the output processed by CNN and BiLSTM models. Kt is calculated to decide its level of influence on the output value. Subsequently, the softmax function is employed to normalize st so that the attention weights at can be obtained. Finally, the weight coefficients and the input vector Kt are used to calculate the weighted features, which are shown as follows: (5) where Wh and bh refer to the weight and bias, respectively.

4.3 Model framework

The apparent features in the fault log data of the supercomputing system were discovered by a CNN to extract the fault features. Subsequently, the data were obtained from the CNN through applying BiLSTM-positive and-negative inputs to extract the fault features with a time series. Finally, the features with a more significant impact on the results can be retrieved by the attention mechanism to enhance the accuracy of fault prediction. The specific structure of the model is illustrated in Fig 6.

After data preprocessing and clustering operations, the obtained data are encoded and processed via the sklearn preprocessing method for the non-numerical components of the data. Subsequently, these data are normalized and transformed for supervised learning.

5. Experiments

This section mainly introduces the experimental part in detail. It is divided into three parts. The first part is a detailed introduction to the data set and model evaluation indicators. The second part will introduce the experimental parameter settings of the comparison experiment. The third part describes the experimental results, which visually display the prediction results of the multivariate time-series model and the comparison test results with the comparison model.

5.1 Dataset description and evaluation indicators

5.1.1 Dataset description.

The fault log of the Shanxi Supercomputing Center cast from 2016 to 2018, which contains 8718121 fault logs, was employed in the experimental data [37]. The system log contains 26 fields, among which 16 are NULL. The following ten invalid fields are removed: the number ID; log record fault occurrence time, ReceivedAt; first time at which failure occurred, DeviceReportedTime; failure device name; facility; failure level priority; failure node number, FromHost; failure message, Message; failure number, InfoUnitID; failure log number, SysLogTag; check code “checksum.”

Since received is the time recorded after the fault is “sensed” by the logging system, it cannot be used as the actual time at which the fault occurred. Therefore, DeviceReportTime is recorded as the occurrence time of the fault which was changed to the date form, whereas the received ID fields were deleted. Due to the fact that the time of failure is uncertain, predicting the time of failure can be regarded as predicting the advanced time of failure. In other words, the time between two failures before and after the prediction and the interval between adjacent failures are calculated as the time interval, and the failure log information with nine fields is obtained. Specific log information is presented in Fig 7.

The fault data were first analyzed in general, and the number of times each node failed was viewed statistically. The results are shown in Fig 8a. In terms of the spatial distribution of system failures and on the foundation, the computer racks contain four computing frames, where each frame contains 32 computing nodes connected through the backplane. Based on this spatial relationship, the spatial probability density diagram of the frame where each failed node is located can be obtained, as shown in Fig 8b. The results show that the first 15 frames present higher risks of failures, which is related to the intensity of the task. Because the failure occurs at uncertain time, it could only be predicted between two failed nodes. The characteristics of the failure time distribution were obtained via analysis, and the results were shown in Fig 8c. Most of the failure intervals were short, indicating that the same failure might have occurred frequently.

thumbnail
Fig 8. Calculation of number of node failures and their spatial and temporal probability densities.

(a) Number of failures per node; (b) spatial probability map of failures; (c) temporal probability map of faults.

https://doi.org/10.1371/journal.pone.0281519.g008

5.1.2 Evaluation indicators.

The two objectives of predicting the fault occurrence time and the fault occurrence node were assessed based on the MAE and root means square error (RMSE) as statistical performance metrics for the results [38]. The MAE and RMSE are expressed as follows: (6) (7)

Confusion matrix [39], also called error matrix, is a standard format for representing accuracy evaluation in the form of a matrix with n rows and n columns. In this way, the four states of true and predicted values are combined: True Positive (TP): the true category of the sample is positive and the model predicts it to be positive; True Negative (TN): the true category of the sample is negative and the model predicts it to be negative; False Positive (FP): the true category of the sample is negative but the model predicts it to be positive; False Negative (FN): the sample’s true category is positive, but the model predicts it as negative [8].

(8)

5.2 Parameter configuration

For the experiment, an Intel (R) Core(TM) i7-10750H CPU @ 2.60 GHz (12 CPUs) equipped with 16 GB of RAM, Windows 11 64-bit, and NVIDIA GeForce RTX3060 were used. Additionally, Python 3.9, TensorFlow 2.6, and scikit were utilized. The deep learning model was trained by Adam as the optimization function, MAE as the loss function, and 50 as the epoch. Among the data, 80 and 20% were allocated as training and prediction data, respectively. The configuration environment above was used for all models. The model constructed in this paper needs to predict both fault time and location, and the prediction model parameters are set as shown in Table 2, the batch is 72, i.e., 72 data are input to the model at a time, and the activation function uses tanh. Training loss: training loss about time reaches below 0.001, and training loss about node prediction reaches 0.01.

thumbnail
Table 2. Hyperparameter settings for fault node location prediction model.

https://doi.org/10.1371/journal.pone.0281519.t002

5.3 Experimental results and analysis

5.3.1 Clustering results.

HDBSCAN clustering was performed to process the preprocessed fault logs, and the clustering results revealed five types of fault characteristics. The distribution of the clustering results is shown in Fig 9a, namely Cluster0, Cluster1, Cluster2, Cluster3, and Cluster4. The distribution chart shows that Cluster2, Cluster0, and Cluster4 constitute 37.45%, 20.04%, and 10.99%, respectively. Based on the priority of fault occurrence, the fault priority can be classified into six levels: Priority0, Priority1, Priority2, Priority3, Priority4, and Priority5, where more than 2000 messages exist owing to Priority0, Priority1, and Priority2. Among them, Priority0, Priority1, and Priority2 contain 49, 1004, and 1161 messages, respectively. Because of the low number of messages, these fault priorities were uniformly classified as “others”, and the six priorities were divided into the following four levels: “other”, Priority3, Priority4, and Priority5. As shown in Fig 9b, the higher the priority, the higher the occurrence probability, and the lower the failure severity. Based on the analysis of each clustering category, as shown in Fig 9c, Cluster0 primarily contains the fault priorities of “other,” Priority3, and Priority4, indicating that the data in Cluster0 are relatively severe faults. The occurrence of faults in Cluster1, Cluster2, and Cluster3 present lower rate of failure, which is due to the data of Priority5. Most of the data in Cluster1 and Cluster3 in these three categories are Priority5 data, whose degree of failure is the lowest, whereas Cluster2 contains some failures of Priority3. The failure data of Cluster4 are more complex than those of other clusters, in which various distribution degrees are indicated. Meanwhile, the distribution of Cluster4 is more complex than those of the other clusters, where all distribution degrees and even distributions are indicated. However, the distribution of Priority4 is the highest, indicating that the data in Cluster4 exhibit an intermediate degree of failure. As shown in Fig 9d, the fault logs for Facility1 and Facility6 only contain 51 and 54 messages, respectively, indicating that these two devices do not fail frequently. Among Facility0, Facility2, Facility3, Facility4, and Facility5, Facility5 is the one most prone to failure, which indicates that the supercomputer of Facility5 is easily to be exposed to failure. In addition, the present research compared the locations of the failed devices in each cluster. As shown in Fig 9e, the failure of Cluster0 occurred primarily in Facility0, whereas those of Cluster1 and Cluster3 occurred primarily in Facility3 and Facility4, respectively. Meanwhile, the failure of Cluster2 occurred in Facility4. Among the clusters, Cluster4 was more complicated, and its fault location was randomly distributed.

thumbnail
Fig 9. HDBSCAN clustering result map.

(a) HDBSCAN clustering result graph, (b) Fault priority distribution diagram, (c) Fault priority distribution diagram of each cluster, (d) Data fault device distribution diagram, and (e) Fault device distribution of each cluster.

https://doi.org/10.1371/journal.pone.0281519.g009

In summary, the fault data were preprocessed by HDBSCAN to categorize the fault category, severity, occurrence location, and susceptibility factor, which enabling more accurate future predictions.

5.3.2 Fault time prediction.

The prediction of fault time from the overall data and the data of each cluster are shown in Fig 10. The effect of the clustered data (b, c, d, e, and f) was more significant than that of the overall data (a) in predicting fault time. The model was based on the clustered preprocessed data, and the prediction results fitted more closely with the training data.

thumbnail
Fig 10. Fault node time prediction training/validation loss.

Training/validation losses of (a) all data, (b) Cluster0, (c) Cluster1, (d) Cluster2, (e) Cluster3, and (f) Cluster4.

https://doi.org/10.1371/journal.pone.0281519.g010

The results predicted by the model for the comprehensive data and each cluster category is illustrated separately in Table 3. The MAE values for Cluster3 and Cluster0 were 0.011 and 0.249, respectively, and their RMSE values were 0.135 and 2.199, respectively. The variation in the MAE and RMSE values is positively correlated with the complexity of the cluster data composition. This indicates that the model possesses good generalization and prediction abilities.

thumbnail
Table 3. Evaluation metrics of model for various types of downtime prediction.

https://doi.org/10.1371/journal.pone.0281519.t003

5.3.3 Fault location prediction.

To predict the location of faulty nodes in a system comprising 2048 nodes, the precise ID of each node and its location must be positioned. The results predicted by the model in this study for the location of the faulty nodes are illustrated in Fig 11, and the comparison experiments are similar to the overall data and the data of each cluster. Based on Fig 11, the prediction results obtained from the clustered data indicate that the predictions of the faulty node locations are more efficiently. As indicated in Table 3, better predictions were yielded after clustering was performed by HDBSCAN. The MAE values of Cluster2 and Cluster4 were 1.49 and 6.60, respectively, and the RMSE values of Cluster2 and Cluster0 were 1.49 and 9.55, respectively. The changes in the MAE and RMSE values were positively correlated with the data composition complexity of the clusters, showing that the model possesses good generalization ability in predicting the location of faulty nodes.

thumbnail
Fig 11. Fault node location prediction training/validation loss.

Training/validation losses for (a) all data, (b) Cluster0, (c) Cluster1, (d) Cluster2, (e) Cluster3, (f) and Cluster4.

https://doi.org/10.1371/journal.pone.0281519.g011

5.3.4 Comparison of models.

To evaluate the predictive power of the multidimensional time series model, experiments were conducted using data collected from a complex Cluster1 to assess the accuracy of the model’s prediction of faulty node locations. For training, a batch size of 256 and an epoch of 50 were used, while other variables were kept constant. The experimental results are shown in Table 4. The prediction accuracy of the proposed model outperforms the SVR, XGBOOST, LSTM and other methods. This is attributed to the HDBSCAN clustering preprocessing capability and the fusion mechanism of the CBA network model, since the CNN-BiLSTM model is able to mine the temporal and spatial features in the fault logs, while the attention mechanism is able to efficiently load sufficient information about the features.

thumbnail
Table 4. Performance of CNN–BiLSTM–attention model in comparison with those of other models.

https://doi.org/10.1371/journal.pone.0281519.t004

Table 5 Performance of CNN–BiLSTM–attention model in comparison with those of other models. The experiments were conducted by applying fault log data. Our proposed model is compared with 5 fault prediction models and 2 ablation experiments are conducted, and the results show that our proposed multidimensional time series model has better granularity (time and location) and prediction accuracy in fault prediction of supercomputing systems.

thumbnail
Table 5. Evaluation metrics for model prediction of various fault node types.

https://doi.org/10.1371/journal.pone.0281519.t005

6. Results and discussion

In this paper, we propose a data preprocessing method based on HDBSCAN clustering to classify faults, and then use CNN-BiLSTM-Attention to build a multidimensional network model to train the preprocessed data. The multidimensional model can effectively extract and fuse the spatial and temporal features in fault logs, and has the advantages of high sensitivity to time series features and sufficient extraction of local features compared with traditional machine learning methods. The average prediction accuracy can reach more than 93%. Although the method proposed in this paper can provide a reference for reliability research of supercomputing systems or intelligent computing systems, and good experimental results have been achieved in practical prediction, the prediction system in this paper is based on historical data, which has insufficient response to real-time fault data and large computational and bandwidth overheads, while the fault data through preprocessing can be used not only for fault analysis and prediction but also for fault-tolerant recovery of the system. In future research, we will improve the speed of data acquisition and pre-processing, optimize the fault analysis and prediction mechanism, and use the mechanism for fault-tolerant recovery of the system. The granularity and accuracy of fault prediction classification will be further improved to reduce the impact of increasing node computation and network overhead during the operation of the prediction model. Second, the scope of prediction can be extended to energy efficiency. This challenge is important for supercomputing providers to minimize costs. In addition, the application of migration learning techniques can be explored to provide a useful reference for fault-tolerant frameworks for supercomputing systems.

References

  1. 1. Das A, Mueller F, Siegel C, Vishnu A. Desh: deep learning for system health prediction of lead times to failure in HPC. Proceedings of the 27th Internationa Symposium on High-Performance Parallel and Distributed Computing. New York, NY, USA: Association for Computing Machinery; 2018. pp. 40–51.
  2. 2. Roman E, Das A, Mueller F, Hargrove PH. Pin-pointing Node Failures in HPC Systems. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); 2020 Mar. https://www.osti.gov/biblio/1605274.
  3. 3. Molan M, Borghesi A, Beneventi F, Guarrasi M, Bartolini A. An Explainable Model for Fault Detection in HPC Systems. In: Jagode H, Anzt H, Ltaief H, Luszczek P, editors. High Performance Computing. Cham: Springer International Publishing; 2021. pp. 378–391.
  4. 4. Mao G, Zeng R, Peng J, Zuo K, Pang Z, Liu J. Reconstructing gene regulatory networks of biological function using differential equations of multilayer perceptrons. BMC Bioinformatics. 2022;23: 503. pmid:36434499
  5. 5. Zhu L, Gu J, Wang Y, Zhao T, Cai Z. Optimizing the fault-tolerance overheads of HPC systems using prediction and multiple proactive actions. J Supercomput. 2015;71: 3668–3694.
  6. 6. Bouguerra MS, Gainaru A, Gomez LB, Cappello F, Matsuoka S, Maruyama N. Improving the Computing Efficiency of HPC Systems Using a Combination of Proactive and Preventive Checkpointing. 2013 IEEE 27th International Symposium on Parallel and Distributed Processing. 2013. pp. 501–512.
  7. 7. Tuli S, Casale G, Jennings NR. PreGAN: Preemptive Migration Prediction Network for Proactive Fault-Tolerant Edge Computing. IEEE INFOCOM 2022—IEEE Conference on Computer Communications. 2022. pp. 670–679.
  8. 8. Frank A, Yang D, Brinkmann A, Schulz M, Süss T. Reducing False Node Failure Predictions in HPC. 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC). 2019. pp. 323–332.
  9. 9. Hu W, Jiang Y, Liu G, Dong W, Cai G. DDC: Distributed Data Collection Framework for Failure Prediction in Tianhe Supercomputers. In: Chen Y, Ienne P, Ji Q, editors. Advanced Parallel Processing Technologies. Cham: Springer International Publishing; 2015. pp. 18–32.
  10. 10. Ebert C, Gallardo G, Hernantes J, Serrano N. DevOps. IEEE Software. 2016;33: 94–100.
  11. 11. Zhu L, Bass L, Champlin-Scharff G. DevOps and Its Practices. IEEE Software. 2016;33: 32–34.
  12. 12. Dang Y, Lin Q, Huang P. AIOps: Real-World Challenges and Research Innovations. 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). 2019. pp. 4–5.
  13. 13. Masood A, Hashmi A. AIOps: Predictive Analytics & Machine Learning in Operations. Cognitive Computing Recipes. 2019; 359–382.
  14. 14. AIOps: Predictive Analytics & Machine Learning in Operations | SpringerLink. [cited 16 Sep 2022]. https://link.springer.com/chapter/10.1007/978-1-4842-4106-6_7.
  15. 15. Wang W, Yang X, Yang C, Guo X, Zhang X, Wu C. Dependency-based long short term memory network for drug-drug interaction extraction. BMC Bioinformatics. 2017;18: 578. pmid:29297301
  16. 16. Gainaru A, Cappello F, Kramer W. Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems. 2012 IEEE 26th International Parallel and Distributed Processing Symposium. 2012. pp. 1168–1179.
  17. 17. Zhong J. Study on Adaptive Failure Prediction Algorithm for Supercomputer. J Inf Comput Sci. 2015;12: 3697–3704.
  18. 18. Jauk D, Yang D, Schulz M. Predicting faults in high performance computing systems: an in-depth survey of the state-of-the-practice. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. New York, NY, USA: Association for Computing Machinery; 2019. pp. 1–13.
  19. 19. Shetty J, Sajjan R, G. S. Task Resource Usage Analysis and Failure Prediction in Cloud. 2019 9th International Conference on Cloud Computing, Data Science & Engineering (Confluence). 2019. pp. 342–348.
  20. 20. Gainaru A, Cappello F, Snir M, Kramer W. Fault prediction under the microscope: A closer look into HPC systems. SC ‘12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 2012. pp. 1–11.
  21. 21. Office FE. Online Failure Prediction in Cloud Datacenters. FUJITSU Sci Tech J. 2014;50.
  22. 22. Bhanage DA, Pawar AV, Kotecha K. IT Infrastructure Anomaly Detection and Failure Handling: A Systematic Literature Review Focusing on Datasets, Log Preprocessing, Machine & Deep Learning Approaches and Automated Tool. IEEE Access. 2021;9: 156392–156421.
  23. 23. Ju J, Liu F-A. Multivariate Time Series Data Prediction Based on ATT-LSTM Network. Applied Sciences. 2021;11: 9373.
  24. 24. Chen X, Lu C-D, Pattabiraman K. Failure Analysis of Jobs in Compute Clouds: A Google Cluster Case Study. 2014 IEEE 25th International Symposium on Software Reliability Engineering. 2014. pp. 167–177.
  25. 25. Zhu B, Wang G, Liu X, Hu D, Lin S, Ma J. Proactive drive failure prediction for large scale storage systems. 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST). Long Beach, CA, USA: IEEE; 2013. pp. 1–5.
  26. 26. Nie B, Xue J, Gupta S, Engelmann C, Smirni E, Tiwari D. Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities. 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). 2017. pp. 22–31.
  27. 27. Srinivasu PN, SivaSai JG, Ijaz MF, Bhoi AK, Kim W, Kang JJ. Classification of Skin Disease Using Deep Learning Neural Networks with MobileNet V2 and LSTM. Sensors. 2021;21: 2852. pmid:33919583
  28. 28. Islam T, Manivannan D. Predicting Application Failure in Cloud: A Machine Learning Approach. 2017 IEEE International Conference on Cognitive Computing (ICCC). 2017. pp. 24–31.
  29. 29. McInnes L, Healy J, Astels S. hdbscan: Hierarchical density based clustering. JOSS. 2017;2: 205.
  30. 30. Behera M, Sarangi A, Mishra D, Mallick PK, Shafi J, Srinivasu PN, et al. Automatic Data Clustering by Hybrid Enhanced Firefly and Particle Swarm Optimization Algorithms. Mathematics. 2022;10: 1–29.
  31. 31. Khan K, Rehman SU, Aziz K, Fong S, Sarasvady S. DBSCAN: Past, present and future. The Fifth International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2014). 2014. pp. 232–238.
  32. 32. Gowanlock M. Hybrid CPU/GPU clustering in shared memory on the billion point scale. Proceedings of the ACM International Conference on Supercomputing. Phoenix Arizona: ACM; 2019. pp. 35–45.
  33. 33. Greff K, Srivastava RK, Koutník J, Steunebrink BR, Schmidhuber J. LSTM: A Search Space Odyssey. IEEE Transactions on Neural Networks and Learning Systems. 2017;28: 2222–2232. pmid:27411231
  34. 34. An Q, Tao Z, Xu X, El Mansori M, Chen M. A data-driven model for milling tool remaining useful life prediction with convolutional and stacked LSTM network. Measurements. 2020;154: 107461.
  35. 35. Staudemeyer RC, Morris ER. Understanding LSTM—a tutorial into Long Short-Term Memory Recurrent Neural Networks. arXiv; 2019 Sep. Report No.: arXiv:1909.09586.
  36. 36. Duan S, Zhao H. Attention Is All You Need for Chinese Word Segmentation. arXiv; 2020 Oct. Report No.: arXiv:1910.14537.
  37. 37. https://github.com/YMyyds/Shanxi-Supercomputing-Center-Fault-Data1.
  38. 38. Wang J, Li J, Wang X, Wang T, Sun Q. An air quality prediction model based on CNN-BiNLSTM-attention. Environ Dev Sustain. 2022 [cited 16 Sep 2022].
  39. 39. Townsend JT. Theoretical analysis of an alphabetic confusion matrix. Perception & Psychophysics. 1971;9: 40–50.