Figures
Abstract
Recently, distributed systems have become the backbone of technological development. It serves as the foundation for new trends technologies such as blockchain, the internet of things and others. A distributed system provides fault tolerance and decentralization, where a fault in any component does not result in a whole system failure. In addition, deep learning model enables processing data to find patterns, which helps in classification, regression, prediction, and clustering. This work employs deep learning to handle faults within distributed systems in three scenarios. Firstly, a faulty processor may not be able to produce the right output. Therefore, deep learning model uses the inputs and outputs of other processors to find patterns and produces the proper output of the faulty processor. Secondly, if a faulty possessor corrupts its inputs as well, then the deep learning model learns from the inputs and the outputs of successful processors and produces the proper output of the faulty processor, even with corrupted inputs. Thirdly, for unrelated data, in which the patterns of the input of the faulty processors differ from the patterns of the inputs of successful ones. In this case, the model is able to discover the new pattern and to be labeled as unknown. In the experiments, we use deep learning models like VGG16, VGG19, AlexNet LSTM and ResNet34, to investigate the performance of the deep learning in the three mentioned scenarios. For unstructured datasets, the accuracy of the models is affected by the size of the faulty data. The accuracy of all models lies between 60% when the size of the faulty data is 90%, and 96%, when the size of the faulty data is 90%. The structured datasets are not significantly affected by the portion of the faulty data and the accuracy reaches 99%.
Citation: Assiri B, Sheneamer A (2025) Fault tolerance in distributed systems using deep learning approaches. PLoS ONE 20(1): e0310657. https://doi.org/10.1371/journal.pone.0310657
Editor: Abul Bashar, Prince Mohammad Bin Fahd University, SAUDI ARABIA
Received: March 10, 2024; Accepted: September 4, 2024; Published: January 7, 2025
Copyright: © 2025 Assiri, Sheneamer. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: We use some standardized datasets and they are cited.
Funding: The authors gratefully acknowledge the funding of the Deanship of Graduate Studies and Scientific Research, Jazan University, Saudi Arabia, through Project Number: GSSRD-24.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
In the last decades, parallel and distributed computing has enhanced the advancement of technology. It serves as the foundation for new trends technologies such as blockchain technology, cloud storage, the internet of things and others. The concept of distributed systems is defined as a collection of components such as processors, storage, communicating networks, input tools, output tools, and actuators [1–3]. The components collaborate together, in a transparent way to appear as a single system, helping to achieve common goals. The distributed systems provide many advantages such as decentralization, efficiency, high throughput, scalability, and reliability [1, 4, 5]. It achieves such advantages through distributing the workload over multiple processors and through fault tolerance.
Fault tolerance means that if any processor fails, the system can keep working. In decentralized distributed systems, there is no single point of failure, since processes run on multiple processors that are connected over a network and it can recover the faulty processor through redistributing its workload to other processors [6, 7] Actually, in distributed systems, there are many fault tolerance techniques [8, 9], as list below:
- Replication: it means to contain multiple copies of data and to store them in multiple places within the system. Therefore, when one copy is corrupted or not accessible because of a faulty processor, other copies are still correct and accessible. Moreover, the corrupt copy of data can be recovered and the system continues to work [9, 10].
- Redundancy: it allows to duplicate processing where the same process is duplicated and processed by more than one processor for more reliability and accuracy of the results, which enhances system functionality [9, 11, 12].
- Checkpoints: it enables periodically saving the system state for easy recovery in case of faults or even failure [13].
- System logs: they are used to timely record all events in the system which helps to track back in case of faults, errors, and intrusions.
- Load balancing and scheduling: load balancing allows takes’ distribution according to the processors’ capabilities to avoid processors’ overloading and failure.
- Consensus protocols: they ensure that the majority of processors agree on some decisions. This helps to determine the faulty processors and data [14–16]
On the other hand, deep learning models help process large amounts of data to find patterns. This enables deep learning algorithms to perform classification, regression, prediction, and clustering. Actually, there are different categories of machine learning models [17, 18], as follow:
- Supervised learning: in which, the model is given some training examples of features that are mapped to the corresponding label. According to these examples, the model is trained to classify or predict the labels of other inputs.
- Unsupervised learning: in which, the model is trained on data and features without labels. The algorithm is used to find patterns and relationships, then it predicts the output for new input. Such kind needs more data for training and it helps to reduce the influence of human factors.
- Reinforcement learning: it does not use labeling such as in supervised and unsupervised models. Basically, it takes actions and analyses the results of these actions. It learns from linking environment’s pre-condition, action and environment’s post-conditions. It is popularly useful in games and robotics.
- Deep learning: it is a subset of machine learning and Artificial Intelligence (AI). it can be supervised, unsupervised, semi-supervised, self-supervised, or reinforcement-based. It uses neural networks with more than three layers to cluster similar inputs and make decisions. It is popularly useful in image recognition and natural language processing [19]. Deep learning will be used in this work.
This work uses deep learning models to handle the faults within distributed systems. In fact, a large task can be divided into many sub-tasks, which are distributed among multiple processors. A faulty processor may corrupt data, may have processing issues or may not be able to communicate properly. Therefore, deep learning model trains the model on the inputs (other sub-tasks) and the outputs of other processors to find patterns and to predict the proper output of the faulty processors. In fact, there are two kinds of input data which are structured and unstructured data. The structured data includes data sets, databases, forms, and others, while unstructured data can be images, texts, voice, and others.
This work proposes a fault tolerance technique in distributed systems using deep learning models. In which, the input is divided into sub-tasks and distributed among processors, some possessors process data successfully whereas some other do not. The faulty processors may not provide the right output or may not provide any output. Consequently, there are three scenarios as follows:
- Safe input and corrupted output.
- Corrupted input and output.
- Safe input (but unrelated) and corrupted output.
Firstly, in case of safe input and corrupted output, the sub-tasks of the faulty processors can be re-sent to other processors to get the right outputs, which is the traditional solution. Otherwise, we can use a deep learning model to learn from the inputs and the outputs of successful processors, to find patterns, and to produce the proper output for the faulty processors. Secondly, if the faulty possessors corrupt their inputs as well, then the traditional solution is not possible and the deep learning model learns from the inputs and the outputs of successful processors, to find patterns and to produce the proper output of the faulty processors, even with corrupted inputs. Thirdly, the case of safe input but unrelated and corrupted output happens when the sub-tasks (input) of the faulty processors differ from the sub-tasks of successful ones. This means the patterns of the sub-tasks of the faulty processors differ from the patterns of sub-tasks of successful ones, which challenges the learning process and the accuracy of prediction. In this case, the deep learning model is able to discover the new patterns and label them as unknown. The unknown label means that this data has its own label that does not exist in the learning data.
This paper uses deep learning models such as VGG16, VGG19, AlexNet LSTM and ResNet34, to investigate the performance of the deep learning models in the three mentioned scenarios. Actually, each scenario is examined using both structured and unstructured data. The used deep learning models are described below:
- VGG16: is the Visual Geometry Group which is a deep convolutional neural network model that has 16 layers [20].
- VGG19: is the Visual Geometry Group which is a deep convolutional neural network model that has 19 layers [20].
- AlexNet: is a deep convolutional neural network that has 8 layers [21].
- LSTM: Long Short Term Memory is a deep convolutional neural network model that has 3 to 4 layers. LSTM is a recurrent neural network since it has feedback connections, which enable it to learn long-term dependencies [22].
- ResNet34: Residual Neural Network is a deep convolutional neural network that has 34 layers [23].
The experimental results show that the accuracy of the mentioned deep learning models in the three scenarios is using both structured and unstructured data. For unstructured data, the results show that the accuracy of the models is affected by the size of the faulty data. The accuracy of all models lies between 60% when the size of the faulty data is 90%, and 96%, when the size of the faulty data is 90%. While the structured data is not significantly affected by the portion of the faulty data and the accuracy reaches 99% for some models. The rest of this article is organized as follows: Section 2 discusses the related work. Section 3 explains the methodology, while Section 4 shows the evaluation and experimental results. Section 6 discusses the threat of validity. Finally, Section 7 concludes the paper.
2 Related work
In distributed systems, the workload is distributed on multiple processors that run in parallel [24]. The advantages of using distributed system are to improve the performance and to increase the throughput. However, distributed system is challenged by many issues such as dependencies among tasks, communication cost, latency and redundancy [25]. Therefore, distributed system uses techniques such as load balancing, leader election, consensus, and fault tolerance to overcome these challenges [4, 26]. The traditional distributed system techniques do not use intelligent tools such as deep learning, where our work focuses on fault tolerance techniques using deep learning.
Researchers investigate distributed systems scalability, resilience, communication issues, and malicious attacks [27]. While others study types of distributed systems faults and failures, then they present a mapping strategy to find suitable fault tolerate techniques for each kind of faults [28, 29]. Actually, our work produces the missing and faulty output according to the correct ones.
Many works focus on fault tolerance and machine learning [30–33]. A fault tolerance framework is provided for iterative-convergent machine learning, where some miner calculation errors influence the training process. They apply fault tolerance on the calculations at some checkpoints within the training process, which reduces failure effects by 78% to 95% [30, 34]. Their techniques are suitable for numeric data, while our work uses deep learning which is more suitable for different kinds of data. Other research applies fault tolerance protocol on cloud computing using Naïve Bayes classifier to enhance reliability [35]. Another work also applies fault tolerance on cloud computing using four machine learning algorithms for job loading and failures. Support vector machine, K-nearest neighbors, logistic regression, and decision tree are used and the accuracy of all classifiers lies between 59% and 61% [36, 37]. In addition, researchers review the intelligent fault-tolerant concept that uses machine and deep learning algorithms for fault discovery and recovery [38].
However, our work uses deep learning models to achieve higher accuracy and to deal with other scenarios such as missing, corrupted, and unrelated input.
Moreover, the first model of neural network started as a one layer model in 1958 [39, 40]. After that, a multiple layers neural network model was introduced using backpropagation algorithm [41, 42]. This model enables training and classification processes. Actually, increasing the number of layers improves the accuracy of the learning model. Deep learning is one approach of a multiple layer neural network model where the number of layers is three or more. Nowadays, deep learning is widely used in the recognition and detection of images, visual objects, and speech [43, 44]. For example, researchers investigate hardware computation issues on two dimensional array computation using deep learning model. They apply fault tolerance concept to reprocess some faulty computation instead of all faults, which is enough to improve computation accuracy to the targeted threshold [45]. Moreover, deep learning results’ reliability is a very critical issue in some applications such as auto-driving, Auto-landing, and robotics [46–48]. Therefore, fault tolerant deep learning model is investigated with the consideration of different deep learning hierarchies, architectures, and layers [46]. Another work studies the performance of deep learning when it is implemented and run on a Message Passing Interface. Message Passing Interface is an interface that is designed to support parallel computing and consensus. The study strongly highlights the need for fault tolerant infrastructure to simultaneously maintain parallelism and deep learning accuracy [49]. On the other hand, our work uses deep learning models to process different kinds of data, to achieve higher accuracy, and to deal with other scenarios such as missing, corrupted, and unrelated input.
3 Methodology
This work applies deep learning techniques to address faults within distributed systems. As illustrated in Fig 1(a), the initial input is divided into multiple sub-tasks, which are then distributed across several processors. When a processor fails, it may either generate corrupted output or fail to produce any output at all. To mitigate this, the deep learning model leverages the inputs (other sub-tasks) and outputs from the functioning processors to identify patterns and generate the correct output, as depicted in Fig 1(b). Following the training process, as shown in Fig 1(c), the deep learning model takes the input intended for the faulty processor and predicts the correct output, effectively compensating for the fault.
Additionally, the proposed methodology evaluates the effectiveness of our model across three different scenarios. In the first scenario, where the input is correct but the output is faulty, rather than re-processing the sub-tasks of the faulty processors using other processors, we employ a deep learning model. This model learns from the inputs and outputs of the successful processors, identifies patterns, and then generates the correct output for the faulty processors, as previously illustrated in Fig 1. Secondly, when faulty processors also corrupt their inputs, the traditional approach of reprocessing sub-tasks becomes ineffective. In this scenario, our deep learning model learns from the inputs and outputs of successful processors to identify patterns and predict the correct output for the faulty processors, even when working with corrupted inputs. To simulate input corruption, we randomly remove approximately 10% of the input data, a scenario we refer to as “missing.” Examples of missing data are illustrated in Figs 2 and 3. Fig 2 shows missing parts in images from a human gait dataset, while Fig 3 depicts missing sections in images representing different driver statuses (e.g., drinking, talking to a passenger, texting, etc.) in a driver distraction dataset. Additionally, we removed portions of feature values in structured data, specifically from the JavaScript vulnerability and KKD CUP99 datasets.
Thirdly, when the input is correct but the output is unrelated or corrupted, this typically occurs when the sub-tasks (input) of faulty processors differ significantly from those of the successful processors. This discrepancy means that the patterns within the faulty processors’ sub-tasks are not aligned with those of the successful ones, complicating the learning process and challenging the accuracy of the deep learning model. In such cases, the deep learning model can detect these new patterns and categorize them under a new label termed as unknown. The unknown label indicates that the data possesses characteristics not present in the training set.
To handle this scenario, we employed an autoencoder deep learning architecture [52, 53] to identify and label the “unknown” class. Autoencoders are particularly useful for detecting anomalies or outliers. We trained the autoencoder on known, or “inlier,” data to establish a baseline for expected reconstruction error. When a new observation is introduced, it is processed through the autoencoder, which computes its reconstruction error. If this error significantly deviates from the expected range for inliers, exceeding a predefined threshold, the observation is classified as belonging to the unknown class.
This paper employs deep learning models, including VGG16, VGG19, AlexNet, and ResNet34, to evaluate their performance across the three aforementioned scenarios using image datasets. Additionally, the VGG16, VGG19, AlexNet, and LSTM models are utilized to test these scenarios on two textual datasets. Indeed, LSTM is chosen for structured data due to its sequential nature, which makes it particularly well-suited for handling structured features.
4 Evaluation
4.1 Dataset description
There are two kinds of input data which are structured and unstructured data. The structured data includes data sets, databases, forms, and others, while unstructured data can be images, texts, voice, and others. In this work, we evaluate our approach using four publicly available datasets, where two datasets are based on images which are distracted driver and Gait human datasets (unstructured datasets), and the other two datasets are based on traditional features such as JavaScript Vulnerability and Intrusion detection system (KDD Cup 99) datasets (structured datasets). The details of the datasets are given in Table 1.
The first dataset we used in this experiment is the distracted driver dataset [51, 54], which contains 17,309 frames distributed over the following classes: Safe Driving (3,686), Phone Right (1,223), Phone Left (1,361), Text Right (1,974), Text Left (1,301), Adjusting Radio (1,220), Drinking (1,612), Hair or Makeup (1,202), Reaching Behind (1,159), and Talking to Passenger (2,570).
The second dataset we used is Gait Recognition (CASIA-A). CASIA-A has 19,135 silhouettes with different walking positions. It consists of twenty classes. Each class in a different name and different images and has different walking corners Created by Wang et al. [50].
The third dataset we used is JavaScript vulnerability dataset (Ferenc et al’s dataset [55]) from the Node Security Project. Ferenc et al’s dataset has 12,125 functions. It consists of two classes. Vulnerability functions contain 1,496 and non-vulnerability functions contain 10,629 functions.
The fourth dataset used in this experiment is the KDD CUP 99 dataset [56], which contains normal data 67,343 samples and four types of attacks, namely, the denial of service (DOS) contains 45927 samples attacks, remote-to-local (R2L) contains 995 samples attacks, user-to-privilege (U2R) contain 52 samples, and needle attack (probe) contains 11,656 samples attacks. Each piece of data contains 41 features [57]. The function description of each feature [58–60].
4.2 Performance measurements
On the other hand, our deep learning models are used to process the mentioned datasets to evaluate the results of our assumptions in the three scenarios. Classification accuracy is commonly used to assess the performance of deep learning models. Actually, a variety of performance measurements are shown to see different sides of the results and to get deep insight. The main measurement is the accuracy of the deep learning. The precision, recall, F1 − score, and AUC are also presented to evaluate our results. In addition, lossfunction is also used with the structured datasets. We also compared prediction accuracy and failures using confusion matrices. The preceding equations’ terms TP, TN, FP, and FN signify true positives, true negatives, false positives, and false negatives, respectively.
The accuracy and F1-score are the primary performance measure for all of the classifier models used in our research. The fraction of accurate predictions to all input samples is measured by a single statistic known as the F1-score. The accuracy or actual positive rate (TPR) is the number of accurate positive outcomes divided by the number of positive results predicted by the classifier. This can be defined as follows: Precision is defined as TP/(TP + FP), where TP is the number of true positives or the correct forecast of a positive sample, and FP is the number of false positives. The results of all equations are between 0 and 1.
In formula 5, H is referred to cross entropy loss function, where p(x) is the true distribution, and q(x), is the estimated distribution, defined over the discrete variable x and is given by formula 5.
(1)
(2)
(3)
(4)
(5)
5 Experimental results
This study aims to investigate the impact of datasets types and size on the classification performance and recommend the appropriate models for limited-size datasets to handle the faults within distributed systems. The experiments examine the scenarios of faults as follows:
- In case of safe input and corrupted output. The dataset is divided into two groups, one represents the none-faulty data which will be the training set, and the other is the faulty data which is used as a testing set. We examine different sizes of faulty data starting from 10% to 90% of the whole data.
- In case of corrupted input and corrupted output, part of the inputs (testing set) is missing as explained earlier. We examine different sizes of faulty data starting from 10% to 90% of the whole data. Our goal is to build a deep learning model that can identify corrupted images. Also, we are able to clean the dataset and apply artifact cleaning. We create random distortion on corrected images to generate a dataset big enough for our experiments. We then build deep learning models based on Convolutional Neural Networks (CNNs) to identify and classify the corrupted images.
- In case of unrelated input and corrupted output, where the unrelated input is classified as unknown. This indicates that the input does not fit under the classes of the training phase. In fact, we conduct the training and then run the classification for the testing set (faulty data). After that, we detect the outliers and classify them as unrelated (unknown), instead of enforcing them under unsuitable class.
- The experiments examine one more scenario that combines both scenarios in points two and three, where the input is corrupted and unrelated at the same time.
For each mentioned scenario we test different sizes of faults, where the percentages of the faulty data can be 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80% or 90% of the whole data. We examine the impact of reducing the size of the training set on the classification performance. After pre-processing the datasets, deep learning models were trained on all datasets. The performance of the classifiers is evaluated with respect to accuracy, precision, recall, specificity, f-score, and AUC.
Four deep learning models, namely, VGG16, VGG19, AlexNet, and ResNet34 are used, to study the performance of the deep learning models in the mentioned scenarios and on different fault sizes using two different images datasets. In addition, VGG16, VGG19, AlexNet, and LSTM deep learning models are used to test the first two mentioned scenarios with different fault sizes on the other two textual datasets.
5.1 Unstructured datasets
In this part, we run the four mentioned scenarios using the four deep learning models, namely, VGG16, VGG19, AlexNet and ResNet34, with different sizes of faulty results, using two unstructured datasets. For the first dataset, namely, the distracted driver dataset, the accuracy is measured as illustrated in Fig 4. The figure shows that for VGG16, the accuracy of the first scenario (when the input is safe) reaches 90%, when the size of faulty data (testing set) is 10% and the accurate data (that is used as a training sets) is 90%. By increasing the size of the faulty data and decreasing the size of the training set, the accuracy is gradually dropped until it becomes 69%, when the faulty data is 90%. The second scenario deals with corrupted input by having some missing data scenario. The VGG16 accuracy reaches 92%, when the size of faulty data (testing set) is 10% and the accurate data (that is used as a training sets) is 90%. By increasing the size of the faulty data and decreasing the size of the training set, the accuracy is gradually dropped until it reaches 67%, when the faulty data is 90%. The third scenario has unrelated input that is classified as unknown. In which, the VGG16 accuracy reaches 89%, when the faulty data is 10%, and then decreases to 66%, when the size of faulty data is 90%. The fourth scenario uses corrupted (missing) and unrelated input, where the VGG16 accuracy reaches 84%, when the faulty data is 10%, and then decreases to 64%, when the size of faulty data is 90%.
The same thing is applied to VGG19, AlexNet, and ResNet34. In general, the accuracy of the first scenario is the best in most tests since the input (testing set) is safe. The second scenario reaches almost the same accuracy in comparison to the first one even with some missing input of the testing set. The accuracy of the deep learning models for the third and fourth scenarios is a little less in comparison to the first scenario, however all scenarios are very close and still acceptable.
Table 2 gives more insight views for the performance of the four deep learning models by presenting more supportive details about precision, recall, specificity, f-score, and AUC.
For the second dataset, namely, the Gait Recognition dataset, the accuracy is shown in Fig 5. The figure shows that for VGG16 the accuracy of the first scenario reaches 97%, when the size of the faulty data is 10% and the testing set is 90%. Then it decreases to 77%, when the faulty data is 90%. Second scenario accuracy reaches 95%, when the size of the faulty data is 10%. It drops to 71%, when the size of the faulty data 90%. In the third scenario, the VGG16 accuracy reaches 95%, when the faulty data is 10%, and then decreases to 71%, when the size of faulty data is 90%. In the fourth scenario, the VGG16 accuracy reaches 94%, when the faulty data is 10%, and then decreases to 66%, when the size of faulty data is 90%.
For VGG19, the accuracy of all scenarios lay between 95% and 92%, when the size of faulty data is 10%. Then they decrease until the accuracy becomes between 72% and 68%, by increasing the faulty data size to 90%. Almost similar results are applied for AlexNet. However, ResNet34 has less accuracy, as it drops to about 56% when the faulty data size becomes 90%.
Table 3 gives more insight views for the performance of the four deep learning models by presenting more supportive details about precision, recall, specificity, f-score, and AUC.
5.2 Structured datasets
For more verification, the vulnerability dataset is used to conduct the test of the same scenarios using VGG16, VGG19, AlexNet, and LSTM deep learning models, with different portions of fault size. Fig 6 demonstrates vulnerability dataset accuracy results using the four deep learning models. Actually, in this part, we focus only on the first two scenarios. The accuracy of both scenarios is almost the same using all models. Indeed, VGG16, VGG19, AlexNet outperform LSTM. In addition, Fig 7 shows the best Vulnerability dataset results based on loss function. In which, a loss function measures how good the deep learning model does in terms of being able to match the expected output. Therefore, the loss function is used with the structured datasets since the expected output is present. The experiment shows that the loss function decreases as the faulty data size decreases and the number of epochs increases with all models. Moreover, Fig 8 illustrates the best missing vulnerability dataset results based on the loss function, where the loss function decreases as the faulty data size decreases and the number of epochs increases with all models.
Furthermore, KDD Cup99 dataset is used to conduct the same tests using VGG16, VGG19, AlexNet and LSTM deep learning models with different portions of fault size. Fig 9 demonstrates KDD Cup99 dataset accuracy results using four deep learning models. As we mentioned earlier, this part focuses only on the first two scenarios. The accuracy of both scenarios is almost the same for all models. Indeed, VGG16, VGG19, and AlexNet outperform LSTM under all sizes of faulty data. In addition, Fig 10 shows the best KDD Cup99 dataset results based on the loss function. In which, the experiment shows that the loss function decreases as the faulty data size decreases and the number of epochs increases with all models. Moreover, Fig 11 illustrates the best missing intrusion dataset results based on the loss function, where the loss function decreases and the number of epochs increases as the faulty data size increases with all models.
6 Limitations and threat of validity
Now, it is important to illustrate the challenges and the threat of validity of the proposed model, as follows:
- Fault Tolerance Scope: Employing intelligent techniques such as deep learning techniques in the distributed system and fault recovery is an innovative task, without deep history or approved benchmark. Actually, deep learning models are effective in managing certain types of faults within distributed systems. However, their approach may not cover all potential fault scenarios, especially those involving complex, interconnected faults that can propagate across various system components.
- Dataset Dependence: Finding an existing suitable dataset to test the proposed idea is another challenge. In fact, the models’ performance is heavily reliant on the quality and nature of the datasets they are trained on. For example, models trained on specific structured or unstructured data types might not perform optimally when applied to different data types. Therefore, train our models on both structured and unstructured datasets.
- Fault Types: Examining different kinds of faults is another issue, where we investigate different kinds of faults, using three scenarios.
- Fault Ratio: Fault ratios are another critical issue to investigate. Our work extends the experiment to test different fault ratios. Moreover, in cases involving larger faults, the recovery time could be considerable.
- Computational Overhead: Implementing deep learning models for real-time fault detection and correction introduces computational overhead, which could be problematic for systems with stringent latency requirements. For example, the time taken for a model to make predictions during the fault recovery process can also be a bottleneck, especially in real-time systems where rapid decision-making is essential.
- Training Time: Training deep learning models, particularly with large datasets or complex architectures like VGG16, VGG19, or ResNet34, is time-intensive. This significant time investment must be taken into account when deploying these models in practical applications.
- Additionally, finding proper deep learning techniques and to generalize our findings are vital points to consider.
7 Conclusion
This work leverages the strengths of deep learning, particularly in pattern recognition and prediction, to address errors in distributed systems across three scenarios:
- Processor Malfunction: When a processor fails to generate the correct output, deep learning models can use the inputs and outputs from other functioning processors to identify patterns and reconstruct the correct output of the malfunctioning processor.
- Corrupted Inputs: The deep learning model learns from the inputs and outputs of successful processors to detect patterns and accurately predict the correct output of faulty processors, even when their inputs are corrupted.
- Distinct Input Patterns: In cases where the input patterns of unsuccessful processors differ significantly from those of successful ones, the deep learning model can identify these as new patterns, categorizing them as Unknown.
We employ deep learning architectures such as VGG16, VGG19, AlexNet, LSTM, and ResNet34 to evaluate the performance of deep learning in these scenarios. The analysis includes both structured and unstructured data, demonstrating that our models achieve high detection accuracy when compared to non-faulty processing and data.
References
- 1.
Van Steen M, Tanenbaum AS. Distributed systems. Maarten van Steen Leiden, The Netherlands; 2017.
- 2.
Hajij M, Assiri B, Rosen P. Parallel mapper. In: Proceedings of the Future Technologies Conference (FTC) 2020, vol. 2. Springer; 2021: 717–731.
- 3.
Hajij M, Assiri B, Rosen P. Distributed mapper. arXiv preprint, arXiv:1712.03660; 2017.
- 4.
Assiri B, Busch, C. In: 2016 45th International Conference on Parallel Processing Workshops (ICPPW). IEEE; 2016: 393–402.
- 5. Assiri B, Khan WZ. Fair and trustworthy: Lock-free enhanced tendermint blockchain algorithm. TELKOMNIKA (Telecommunication Computing Electronics and Control. 2020; 18(4):2224–2234.
- 6. Koam A. Ahmad A, Abdelhag M, Azeem M. Metric and fault-tolerant metric dimension of hollow coronoid. IEEE Access, IEEE. 2021; 9:81527–81534.
- 7. Koam A, Ahmad A, Ibrahim M, Azeem M. Edge metric and fault-tolerant edge metric dimension of hollow coronoid. Mathematics, MDPI. 2021; 9(12):1405.
- 8. Sari A, Akkaya M, et al. Fault tolerance mechanisms in distributed systems. International Journal of Communications, Network and System Sciences, Scientific Research Publishing. 2015; 8(12):471.
- 9.
Chan PPW, Lyu MR, Malek M. Making services fault tolerant. In: Service Availability: Third International Service Availability Symposium, ISAS 2006, Helsinki, Finland, May 15-16, 2006. Revised Selected Papers 3. Springer; 2006: 43–61.
- 10.
Cachin C, Schubert S, Vukolić M. Non-determinism in byzantine fault-tolerant replication. arXiv preprint, arXiv:160307351; 2016.
- 11. Rullo A, Serra E, Lobo J. Redundancy as a measure of fault-tolerance for the Internet of Things: A review. Policy-Based Autonomic Data Governance; 2019: 202–226.
- 12. Assiri B, Hossain MA. Face emotion recognition based on infrared thermal imagery by applying machine learning and parallelism. Math Biosci Eng. 2023;20(1):913–929. pmid:36650795
- 13.
Masih R, Alam S, Tabassum S, Shuaib M, Ahmad S. An Analysis of the Significant Role Played by Mobile Cloud Forensics and its Key Obstacles. In: 2023 International Conference in Advances in Power, Signal, and Information Technology (APSIT). IEEE; 2023: 578–582.
- 14. Barborak M, Dahbura A, Malek M. The consensus problem in fault-tolerant computing. ACM Computing Surveys (CSur). 1993;25(2):171–220.
- 15. Pullum LL. Software fault tolerance techniques and implementation. Artech House; 2001.
- 16.
Numan M, Subhan F, Khan WZ, Assiri B, Armi N. Well-organized bully leader election algorithm for distributed system. In: 2018 International Conference on Radar, Antenna, Microwave, Electronics, and Telecommunications (ICRAMET). IEEE; 2018: 5–10.
- 17.
Zhou ZH. Machine learning. Springer Nature; 2021.
- 18.
El Naqa I, Murphy MJ. What is machine learning?. Springer; 2015.
- 19.
Kelleher JD. Deep learning. MIT press; 2019.
- 20.
Mascarenhas S, Agarwal M. A comparison between VGG16, VGG19 and ResNet50 architecture frameworks for Image Classification. In: 2021 International conference on disruptive technologies for multi-disciplinary research and applications (CENTCON). vol. 1. IEEE; 2021: 96–99.
- 21.
Alom MZ, Taha TM, Yakopcic C, Westberg S, Sidike P, Nasrin MS, et al. The history began from alexnet: A comprehensive survey on deep learning approaches. arXiv preprint arXiv:180301164; 2018.
- 22.
Staudemeyer RC, Morris ER. Understanding LSTM–a tutorial into long short-term memory recurrent neural networks. arXiv preprint arXiv:190909586; 2019.
- 23. Koonce B, Koonce B. ResNet 34. Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization; 2021: 51–61.
- 24. Shabani I, Mëziu E, Berisha B, Biba T. Design of modern distributed systems based on microservices architecture. International Journal of Advanced Computer Science and Applications. 2021;12(2).
- 25. Amiri Z, Heidari A, Navimipour NJ, Unal M. Resilient and dependability management in distributed environments: A systematic and comprehensive literature review. Cluster Computing. 2023;26(2):1565–1600.
- 26.
Jalote P. Fault tolerance in distributed systems. Prentice-Hall, Inc.; 1994.
- 27.
Liu S. A survey on fault-tolerance in distributed optimization and machine learning. arXiv preprint, arXiv:210608545; 2021.
- 28.
Ledmi A, Bendjenna H, Hemam SM. Fault tolerance in distributed systems: A survey. In: 2018 3rd International Conference on Pattern Analysis and Intelligent Systems (PAIS). IEEE; 2018: 1–5.
- 29. Zhong W, Yang C, Liang W, Cai J, Chen L, Liao J, et al. Byzantine fault-tolerant consensus algorithms: A survey. Electronics. 2023;12(18):3801.
- 30.
Qiao A, Aragam B, Zhang B, Xing E. Fault tolerance in iterative-convergent machine learning. In: International Conference on Machine Learning. PMLR; 2019: 5220–5230.
- 31. Myllyaho L, Raatikainen M, Mannisto T, Nurminen JK, Mikkonen T. On misbehaviour and fault tolerance in machine learning systems. Journal of Systems and Software. 2022;183:111096.
- 32. Zhang J, Rong Y, Cao J, Rong C, Bian J, Wu W. DBFT: A Byzantine fault tolerance protocol with graceful performance degradation. IEEE Transactions on Dependable and Secure Computing. 2021;19(5):3387–3400.
- 33. Nguyen TH, Imran M, Choi J, Yang JS. Craft: Criticality-aware fault-tolerance enhancement techniques for emerging memories-based deep neural networks. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems; 2023.
- 34.
Power A, Kotonya G. Providing fault tolerance via complex event processing and machine learning for iot systems. In: Proceedings of the 9th International Conference on the Internet of Things; 2019: 1–7.
- 35. Kochhar D, Jabanjalin H. An approach for fault tolerance in cloud computing using machine learning technique. International Journal of Pure and Applied Mathematics. 2017; 117(22):345–351.
- 36. Khalil K, Eldash O, Kumar A, Bayoumi M. Machine learning-based approach for hardware faults prediction. IEEE Transactions on Circuits and Systems I: Regular Papers. 2020;67(11):3880–3892.
- 37. Kalaskar C, Thangam S. Fault tolerance of cloud infrastructure with machine learning. Cybernetics and Information Technologies. 2023;23(4):26–50.
- 38. Amin AA, Iqbal MS, Shahbaz MH. Development of intelligent fault-tolerant control systems with machine leaprning, deep learning, and transfer learning algorithms: A review. Expert Systems with Applications; 2023: 121956.
- 39. Rosenblatt F. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review. 1958;65(6):386. pmid:13602029
- 40.
Kussul E, Baidyk T, Kasatkina L, Lukovich V. Rosenblatt perceptrons for handwritten digit recognition. In: IJCNN’01. International Joint Conference on Neural Networks. Proceedings (Cat. No. 01CH37222). vol. 2. IEEE; 2001: 1516–1520.
- 41.
Rumelhart DE, Hinton GE, Williams RJ. Learning internal representations by error propagation. Parallel Distributed Processing: Exploration in the Microstructure of Cognition, vol. 1. Foundations; 1986: 318–362.
- 42.
Werbos P. Beyond regression: New tools for prediction and analysis in the behavioral sciences. PhD thesis, Committee on Applied Mathematics, Harvard University, Cambridge, MA.; 1974.
- 43. Du KL, Leung CS, Mow WH, Swamy M. Perceptron: Learning, generalization, model selection, fault tolerance, and role in the deep learning era. Mathematics. 2022;10(24):4730.
- 44. Sharifani K, Amini M. Machine learning and deep learning: A review of methods and applications. World Information Technology and Engineering Journal. 2023;10(07):3897–3904.
- 45. Liu C, Chu C, Xu D, Wang Y, Wang Q, Li H, et al. HyCA: A hybrid computing architecture for fault-tolerant deep learning. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. 2021;41(10):3400–3413.
- 46.
Liu C, Gao Z, Liu S, Ning X, Li H, Li X. Special session: Fault-tolerant deep learning: A hierarchical perspective. In: 2022 IEEE 40th VLSI Test Symposium (VTS). IEEE; 2022: 1–12.
- 47. Eroglu B, Sahin MC, Ure NK. Autolanding control system design with deep learning based fault estimation. Aerospace Science and Technology. 2020;102:105855.
- 48. Yue Y, Honglun W. Deep learning-based reentry predictor-corrector fault-tolerant guidance for hypersonic vehicles. Acta Armamentarii. 2020;41(4):656.
- 49.
Amatya V, Vishnu A, Siegel C, Daily J. What does fault tolerant deep learning need from mpi?. In: Proceedings of the 24th European MPI Users’ Group Meeting; 2017: 1–11.
- 50. Wang L, Tan T, Ning H, Hu W. Silhouette analysis-based gait recognition for human identification. IEEE transactions on pattern analysis and machine intelligence. 2003;25(12):1505–1518.
- 51. Eraqi HM, Abouelnaga Y, Saad MH, Moustafa MN, et al. Driver distraction identification with an ensemble of convolutional neural networks. Journal of Advanced Transportation; 2019.
- 52. Kramer MA. Autoassociative neural networks. Computers chemical engineering. 1992;16(4):313–328.
- 53.
Chen J, Sathe S, Aggarwal C, Turaga D. Outlier detection with autoencoder ensembles. In: Proceedings of the 2017 SIAM international conference on data mining. SIAM; 2017: 90–98.
- 54.
Abouelnaga Y, Eraqi HM, Moustafa MN. Real-time distracted driver posture classification. arXiv preprint, arXiv:170609498; 2017.
- 55.
Ferenc R, Hegedűs P, Gyimesi P, Antal G, Bán D, Gyimóthy T. Challenging machine learning algorithms in predicting vulnerable javascript functions. In: 2019 IEEE/ACM 7th International Workshop on Realizing Artificial Intelligence Synergies in Software Engineering (RAISE). IEEE; 2019: 8–14.
- 56.
Tavallaee M, Bagheri E, Lu W, Ghorbani AA. A detailed analysis of the KDD CUP 99 data set. In: 2009 IEEE symposium on computational intelligence for security and defense applications. Ieee; 2009: 1–6.
- 57.
Divekar A, Parekh M, Savla V, Mishra R, Shirole M. Benchmarking datasets for anomaly-based network intrusion detection: KDD CUP 99 alternatives. In: 2018 IEEE 3rd international conference on computing, communication and security (ICCCS). IEEE; 2018: 1–8.
- 58. Ambusaidi MA, He X, Nanda P, Tan Z. Building an intrusion detection system using a filter-based feature selection algorithm. IEEE transactions on computers. 2016;65(10):2986–2998.
- 59. Depren O, Topallar M, Anarim E, Ciliz MK. An intelligent intrusion detection system (IDS) for anomaly and misuse detection in computer networks. Expert systems with Applications. 2005;29(4):713–722.
- 60. Song Y, Li H, Xu P, Liu D, et al. A method of intrusion detection based on woa-xgboost algorithm. Discrete Dynamics in Nature and Society;2022.