Figures
Abstract
Against the backdrop of increasingly mature intelligent driving assistance systems, effective monitoring of driver alertness during long-distance driving becomes especially crucial. This study introduces a novel method for driver fatigue detection aimed at enhancing the safety and reliability of intelligent driving assistance systems. The core of this method lies in the integration of advanced facial recognition technology using deep convolutional neural networks (CNN), particularly suited for varying lighting conditions in real-world scenarios, significantly improving the robustness of fatigue detection. Innovatively, the method incorporates emotion state analysis, providing a multi-dimensional perspective for assessing driver fatigue. It adeptly identifies subtle signs of fatigue in rapidly changing lighting and other complex environmental conditions, thereby strengthening traditional facial recognition techniques. Validation on two independent experimental datasets, specifically the Yawn and YawDDR datasets, reveals that our proposed method achieves a higher detection accuracy, with an impressive 95.3% on the YawDDR dataset, compared to 90.1% without the implementation of Algorithm 2. Additionally, our analysis highlights the method’s adaptability to varying brightness levels, improving detection accuracy by up to 0.05% in optimal lighting conditions. Such results underscore the effectiveness of our advanced data preprocessing and dynamic brightness adaptation techniques in enhancing the accuracy and computational efficiency of fatigue detection systems. These achievements not only showcase the potential application of advanced facial recognition technology combined with emotional analysis in autonomous driving systems but also pave new avenues for enhancing road safety and driver welfare.
Citation: Lin N, Zuo Y (2024) Advancing driver fatigue detection in diverse lighting conditions for assisted driving vehicles with enhanced facial recognition technologies. PLoS ONE 19(7): e0304669. https://doi.org/10.1371/journal.pone.0304669
Editor: Essam Debie, University of New South Wales, AUSTRALIA
Received: December 12, 2023; Accepted: May 15, 2024; Published: July 10, 2024
Copyright: © 2024 Lin, Zuo. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The Yawn Dataset by David Hiram Vazquez Santana (2021) can be accessed by registering for any Kaggle account through the following link: https://www.kaggle.com/datasets/davidvazquezcic/yawn-dataset/data. The YawDD: Yawning Detection Dataset by Shabnam Abtahi et al. (2020) can be accessed by registering for any IEEE account via the link: https://dx.doi.org/10.21227/e1qm-hb90. Both datasets are open source, and we did not have any special access privileges that others would not have.
Funding: Project 1: Research and implementation of C-language mobile learning platform based on cloud technology. Project No.: 2021KY1805 Project 2: Research on the digital protection method of Chinese ancient architecture based on BIM. Project No.: 2019JSGC14.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
1.1 Background
In the advancement of intelligent driving assistance systems, monitoring driver fatigue has emerged as a crucial technological and ethical challenge to ensure road safety. This technology’s evolution not only safeguards driving safety but also profoundly embodies the respect for human life and dignity. Through ongoing scientific research, our comprehension of fatigue driving’s intricate nature has significantly expanded, particularly under rapidly changing lighting conditions, such as when drivers encounter varying levels of brightness (e.g., entering tunnels or when surrounded by diverse structures like buildings and overpasses). In these contexts, the yawning behavior of drivers, a key indicator of fatigue, introduces an augmented risk factor. Such sudden lighting changes can impair a driver’s visual adaptability and focus, thus elevating the risk of accidents. This scenario underscores the critical role of fatigue monitoring technology in promoting driving safety and highlights the challenges in effective monitoring within specific environmental settings.
Firstly, a study by Azam et al. (2014) highlighted that fatigue-related traffic accidents account for 10% on regular roads and 28% on highways, underscoring the varying impact of driver fatigue across different road types and setting the stage for targeted fatigue monitoring under diverse lighting conditions [1]. Subsequently, Liu et al. (2016) introduced a real-time fatigue detection method utilizing extreme learning machines, marking a significant technological advance and providing a crucial reference for adapting fatigue monitoring to changes in lighting [2]. The significance of in-vehicle warning systems, especially in scenarios marked by significant lighting shifts, was further underlined by Richardson (2019) [3]. Research by Vasile Plămădeală et al. [4] presented that 75% of fatal accidents are attributed to human factors such as fatigue, particularly in conditions of night driving or sudden lighting changes, thereby significantly heightening the risk of fatigue-induced accidents. Lastly, Davidović et al. (2020) found that fatigue driving contributes to 26% of all traffic accidents, reinforcing the critical need for fatigue monitoring technologies to bolster road safety amidst variable lighting environments [5].
This body of work not merely aggregates data and developments but serves as a crucial reminder of the collective responsibility shared by drivers towards ensuring the safety of all road users. It stresses the importance of leveraging technological advancements in fatigue detection to foster safer driving environments under diverse and challenging conditions.
1.2 Literature review
In the field of computer vision technology for driver fatigue detection, researchers have proposed various innovative methods in recent years (Kim et al. (2021) and Poulose et al. (2021)) [6, 7]. Zhang et al. (2015) used a fast and robust facial detection algorithm and Boost-LBP features for driver fatigue facial expression recognition, achieving significant results [8]. However, their method lacked robustness under complex lighting conditions, limiting its practical application range. Tao et al. (2017) introduced a method that aligned and normalized facial sequences to extract features related to fatigue expressions and used a sliding window for fatigue detection [9]. Despite improvements in feature extraction, this method’s computational efficiency in processing real-time video data still needed enhancement. Jia et al. (2021) proposed a fatigue detection method based on CNN-HMM, detecting the driver’s eyes, mouth, and head posture with an accuracy of 97.5% [10]. However, this method did not fully consider the impact of the driver’s emotional state on fatigue detection, an important dimension of fatigue detection.
Sacco et al. (2012) demonstrated a real-time non-invasive fatigue monitoring system using facial expressions, achieving an accuracy of 95.2% [11]. However, this system had limited adaptability to facial occlusions and varied expressions, potentially limiting its effectiveness under actual road conditions. Khan et al. (2014) proposed a comprehensive vision-based method to detect driver fatigue, achieving an average accuracy of 97.7% [12]. However, the adaptability of this method in complex environments, particularly under changing lighting and weather conditions, had not been fully verified. Qunzhu et al. (2019) developed an improved random forest cascade regression algorithm for detecting facial feature points in driver fatigue detection [13]. Although the method performed well in feature point detection, there was room for improvement in recognizing complex facial expressions and subtle fatigue signals. You et al. (2020) described a real-time driver fatigue detection algorithm based on facial motion entropy, achieving an accuracy of 94.32% [14]. Despite its high accuracy, the method’s computational efficiency and resource consumption in processing large volumes of real-time video data remained a challenge. Dong et al. (2021) proposed a method to detect driver fatigue and distraction, improving accuracy and computation time [15]. However, the universality and scalability of this method across different drivers and vehicle types had not been fully validated. Dong et al. (2022) discussed a method using random forests and convolutional neural networks to detect driver fatigue and distraction behaviors [16]. This research excelled in handling complex behavior analysis but still needed further study for real-time processing and low resource consumption.
Although existing research has made significant technological advancements, there are still deficiencies in handling facial recognition and emotion analysis in complex driving environments under changing lighting conditions (as shown in Table 1). In response to this challenge, we propose a comprehensive fatigue detection method combining advanced facial recognition technology and emotion state analysis, with a particular emphasis on accuracy and adaptability under drastic changes in lighting conditions.
1.3 Our contribution
This study is committed to breaking through the limitations of existing technologies in the field of driver fatigue monitoring, with a special focus on fatigue detection under changing lighting conditions. Our research contributions are mainly reflected in the following aspects:
- Development of an advanced data preprocessing method based on deep learning technology, capable of extracting facial features with unprecedented accuracy under changing lighting conditions. This method not only enhances the accuracy of feature extraction but also strengthens the system’s stability under different lighting intensities and partial facial occlusions.
- Introduction of an innovative dynamic keyframe extraction algorithm. Unlike traditional methods, our algorithm intelligently selects keyframes based on the dynamic changes in video content under changing lighting conditions. This approach significantly improves the efficiency of video data processing under varying lighting conditions, reducing the demand for computational resources and enabling real-time or near-real-time fatigue detection.
- Design of an innovative composite action recognition network, combining multiple neural network technologies such as 3D convolutional networks and long short-term memory networks (LSTM), to enhance the recognition of subtle facial movements in continuous video frames under different lighting conditions. This network not only captures dynamic changes over time but also processes long-term temporal dependencies, significantly improving the accuracy of detecting minor facial movements related to fatigue driving under changing lighting conditions.
This paper is structured as follows: Section 1 introduces the study’s background and reviews relevant literature, identifying gaps our work aims to fill. In Section 2, we detail our proposed methodology for detecting driver fatigue, including data preprocessing, dynamic keyframe extraction, and the composite action recognition network. Section 3 shows the Algorithm Pseudocode. Section 4 presents experimental setups, datasets, and results, demonstrating the effectiveness of our approach. Finally, Section 5 concludes the paper, summarizing our findings and suggesting avenues for future research.
2 Our approach
2.1 Problem description
Problem 1. In intelligent driving assistance systems, accurately identifying driver fatigue from facial behaviors (such as yawning, blinking, etc.) is challenging. Facial behaviors are critical indicators of fatigue states as they are usually involuntary and can be effectively captured by automated systems. We define the detection of driver fatigue states as a time series analysis problem, where each time point’s facial state can be represented by a multidimensional feature vector Xt. Our goal is to design an algorithm that can accurately recognize fatigue patterns within these feature vectors.
Specifically, we need to solve the following mathematical problem:
(1) where F(Xt) represents the fatigue detection function at time t, N is the size of the considered time window, wi are weight parameters, g is a nonlinear function (e.g., convolutional neural network or recurrent neural network) used to extract relevant information from the facial feature vector Xt−i, θ represents the parameters of function g, and b is a bias term.
We also need to consider the impact of environmental factors on fatigue detection, which can be expressed as:
(2) where Et represents environmental factors at time t, Lt is the light intensity, Ct is the noise level inside the vehicle, and α and β are influence coefficients.
Therefore, the final fatigue detection model can be expressed as:
(3) where Yt is the fatigue state output at time t, and σ is an activation function (such as Sigmoid or Softmax) used to convert the model output into a probability of fatigue state.
Our goal is to minimize the prediction error, that is:
(4) where
is the true fatigue state at time t, and T is the total observation time.
2.2 Enhanced data preprocessing
2.2.1 Motivation for data preprocessing enhancement.
The advancement in driver fatigue detection is challenged by the complexity of driving environments and the diversity of driver behaviors, necessitating robust data preprocessing methods [17–19]. Traditional facial behavior recognition techniques often falter with complex, blurred, or dynamically changing expressions, influenced by varying lighting and camera angles. To address these issues, we propose an enhanced data preprocessing approach that leverages deep learning-based CNN technology for precise facial feature extraction and introduces emotion state analysis. This dual strategy allows for a more accurate and comprehensive detection of fatigue states, even in environments lacking direct environmental sensing, thereby significantly improving the reliability and robustness of fatigue detection systems.
2.2 2. Mathematical principle.
Under the enhanced data preprocessing framework, we utilize deep learning models to extract facial features and combine them with emotion state analysis to enhance the accuracy of fatigue detection. Deep learning models are chosen for their excellent performance in handling high-dimensional data and capturing complex patterns. This process can be described by the following mathematical models:
In our approach, the extraction of facial feature vectors and the analysis of emotion states are pivotal for enhancing the accuracy and reliability of fatigue detection. Instead of detailing these processes through complex mathematical equations, we simplify our explanation to make our methodology accessible to a wider audience.
The generation of the facial feature vector Vt is accomplished through the use of a CNN model. This process involves inputting a facial image It into the CNN, which is configured with a specific set of parameters (Φ). The CNN employs layers of neurons equipped with weights and biases to apply activation functions, such as the ReLU (Rectified Linear Unit), thereby extracting meaningful features from the facial image that are indicative of fatigue.
Following the feature extraction step, we analyze the emotional state of the driver using a Deep Neural Network (DNN), characterized by its parameters (Ψ). This network processes the extracted facial features Vt and utilizes functions like tanh (hyperbolic tangent) to interpret these features in the context of emotional states, contributing further to our understanding of the driver’s fatigue level.
These two models together constitute our enhanced data preprocessing framework, aiming to improve the accuracy and reliability of fatigue detection through in-depth analysis of facial features and emotion states.
Considering facial features and emotion states together, the fatigue detection problem can be redefined, and a theorem on the accuracy of fatigue detection and a corollary on the importance of data preprocessing can be proposed.
Problem 2. Given a series of time-series facial image frames, we need to design an algorithm to accurately identify the driver’s fatigue state. This problem can be described as optimizing the following mathematical model:
(5)
With the integration of emotion state analysis, the fatigue detection model is updated as:
(6)
The goal is to minimize the prediction error:
(7)
Theorem 1 (Accuracy of Fatigue Detection). For any given continuous video sequence, if there is a time window size N and a sufficiently complex feature extraction network, it is possible to accurately identify the driver’s fatigue state through this network. In particular, if the feature extraction network can maximize the joint information entropy H(Vt, Et) between the facial feature vector Vt and the emotion state Et, the identification of fatigue states will be more accurate.
Proof is provided in the S1 Appendix.
Corollary 1 (Importance of Data Preprocessing). Let represent the model in the fatigue detection system,
represent the original facial behavior dataset, and
represent the data preprocessing function. Then, effective data preprocessing
can significantly enhance the performance of
on the dataset
.
Proof is provided in the S1 Appendix.
2.3 Dynamic keyframe extraction
2.3.1 Motivation for dynamic keyframe extraction.
Traditional facial behavior analysis and fatigue detection in rapidly changing environments pose significant challenges, particularly in processing high-frequency video data. The limitations of conventional methods become evident as they struggle with transient facial expressions and environmental lighting variations, often missing critical data [20, 21]. This necessitates a more dynamic approach to data processing, where keyframe extraction doesn’t just depend on static intervals but dynamically adapts to the content changes, especially in lighting. Our proposed dynamic keyframe extraction technique addresses these issues by intelligently identifying and extracting frames that represent significant behavioral changes, optimizing computational efficiency and accuracy in fatigue detection even in real-time or near-real-time applications.
2.3.2 Mathematical principles.
To deeply implement dynamic keyframe extraction, particularly considering rapidly changing lighting conditions, we have designed a series of complex mathematical models integrating calculus, optimization theory, and advanced statistical methods. Our goal is to accurately identify the driver’s fatigue state under various environmental conditions, especially in drastic lighting changes. Firstly, we define a multi-layer image sequence difference measure function Dt considering lighting changes:
(8)
where
and
are the feature vectors of the image frames at times t and t − i at layer p, respectively, and αp, βq, γr are weight coefficients, with ∇2 and ∇ representing the second and first order derivatives, respectively. This multi-layer difference measurement helps in accurately capturing subtle changes in the image sequence caused by lighting variations, thus identifying keyframes with the most dynamic changes.
Secondly, we introduce a highly complex optimization model considering lighting changes for selecting the optimal set of keyframes St:
(9)
where θi, ϕ, ψ, κj, ωk, νl are adjustment factors, and δ is a preset threshold. These adjustment factors allow the model to flexibly adapt during keyframe extraction, addressing various video sequences and environmental changes, including lighting variations.
Finally, we redefine our problem and propose a theorem about the effect of environmental factors, especially lighting changes, on fatigue prediction and the effectiveness of keyframe extraction.
Problem 3. Given a series of time-series facial image frames, our task is to optimize a problem containing multi-layered, multi-dimensional feature difference measurements and a highly complex optimization model, to dynamically extract keyframes. This aims to accurately identify the driver’s fatigue state, especially in rapidly changing lighting conditions. The optimization problem can be reformulated as follows, (10) where
is the fatigue state prediction output based on the selected keyframes,
represents environmental factors, especially the impact of lighting changes on fatigue prediction, and
is the actual fatigue state at time t.
Theorem 2 (Influence of Environmental Factors). Environmental factors within the vehicle, such as lighting intensity Lt and noise level Ct, significantly impact the accuracy of fatigue detection, which can be expressed by the following formula:
(11)
where ω(Lt, Ct) and λ(Lt, Ct) are functions of lighting intensity and noise level, respectively.
Lemma 1 (Effectiveness of Keyframe Extraction). In the fatigue detection of video sequences, the dynamic keyframe extraction method is more effective than the static frame sampling method, mathematically expressed as:
(12) where θi(Lt, Ct), ϕ(Lt, Ct), κj(Lt, Ct) are functions of the time window, lighting intensity, and noise level.
Proof is provided in the S1 Appendix.
2.4 Composite action recognition network
2.4.1 Motivation for composite action recognition network.
- Traditional driver fatigue detection methods encounter limitations in capturing the nuanced spatiotemporal dynamics of continuous video frames, crucial for identifying subtle fatigue-related facial movements [22, 23]. Addressing these deficiencies, we introduce the Composite Action Recognition Network, merging the capabilities of 3D convolutional networks (3D CNN) and Long Short-Term Memory networks (LSTM). This integration is designed to process the spatial and temporal aspects of extracted keyframes dynamically, enhancing the accuracy and robustness of fatigue detection under variable lighting conditions.
- The use of 3D CNN in our approach might raise questions regarding its complexity and computational demand, particularly given the 3D model’s typical association with high-parameter requirements. However, it’s crucial to clarify that the “3D” aspect in our context refers to processing sequences of 2D images over time, extracted through the “Dynamic Keyframe Extraction” phase, rather than traditional 3-dimensional video data. This methodological choice enables us to efficiently capture the temporal dynamics and subtle changes in facial expressions with significantly reduced computational overhead, making it a practical solution for real-time fatigue detection applications.
2.4.2 Mathematical principles.
First, for the hybrid neural network architecture in the Composite Action Recognition Network, we constructed the following mathematical model.
Hybrid Neural Network Architecture:
(13)
(14)
(15)
where 3DConv and LSTM represent the 3D convolutional and Long Short-Term Memory layers, respectively, and Θ and Λ are their corresponding network parameter sets. Here, θi, Wij, bi are weights and biases of the 3D convolutional layer, while λk, Ukl, ck are weights and biases of the LSTM layer. Gt integrates the outputs of both network layers through the Tanh activation function. Rt is the final output, using the Sigmoid function for the final classification result. This hybrid architecture design leverages the 3D convolutional layer’s capability to capture spatial features and the LSTM layer’s ability to process temporal sequence data.
Additional Feature Extraction and Fusion Layers:
(16)
(17)
where Ft and Ct respectively represent additional feature extraction and fusion layers, further enhancing the network’s feature recognition capabilities. ϕi, Aij, fi are weights and biases of the feature extraction layer, while ψi, Bij, gi are weights and biases of the fusion layer.
Based on the above model, we define Problem 4:
Problem 4. The goal of optimizing the hybrid neural network architecture is to maximize the spatiotemporal feature recognition capability of facial actions, while considering the complexity of dynamic keyframe extraction. The optimization problem can be expressed as:
(18) where
is the fatigue state prediction output based on the hybrid neural network architecture,
is the actual fatigue state at time t, and
is the feature vector based on dynamic keyframe extraction. α, β are weighting factors used to balance the optimization of facial action recognition and keyframe extraction.
In this problem, represents the optimization model for dynamic keyframe extraction, integrating calculus, optimization theory, and advanced statistical methods.
and
respectively represent the complex mathematical models of additional feature extraction and fusion layers, further enhancing the model’s capability to process spatiotemporal data. Θ, Λ, Ξ, Φ, Ψ are sets of network parameters, covering weights and biases across all layers of the hybrid neural network architecture.
2.4.3 Fine-grained action recognition.
Next, we explore fine-grained action recognition. Fine-Grained Action Recognition:
(19)
(20)
(21)
(22)
where ωk, Zkl, fk, μk, Nkl, gk, νk, Okl, hk, ξk, Qkl, ik are weights and biases of each layer. Here, Sigmoid, Tanh, and ReLU are activation functions, and LeakyReLU is an improved ReLU function, used to increase the network’s non-linearity and prevent the problem of gradient vanishing.
These equations collectively form the mathematical model for fine-grained action recognition. Ft represents the feature extraction of the first layer, using the Sigmoid function to process each neuron’s output. Bt is the second layer, using the Tanh activation function to extract more complex features. At is the third layer, employing the ReLU function to add non-linearity to the model. Finally, Dt uses the LeakyReLU function to further enhance the model’s expressive power. Each layer uses different activation functions and weighted sums to extract and merge features at different levels, thereby achieving precise recognition of subtle movements.
Lemma 2 (Complexity of Feature Extraction). Let a deep learning model be used for feature extraction from video data. If
is sufficiently complex, it can more effectively extract fatigue-related features from video data. This can be described by the following mathematical expression:
(23) where ϕi, Aij, fi are weights and biases of the additional feature extraction layer, and ψi, Bij, gi are weights and biases of the fusion layer. ReLU and Sigmoid are activation functions. This complex neural network structure can effectively extract complex features from time-series data.
Proof is provided in the S1 Appendix.
Based on the mathematical model of fine-grained feature extraction, we define Problem 5:
Problem 5. The goal of fine-grained action recognition is to distinguish subtle fatigue-related changes while capturing facial actions. The optimization problem can be represented as:
(24)
(25)
(26) where
is the fatigue state prediction output based on fine-grained feature extraction,
is the actual fatigue state at time t, and
is the feature vector based on dynamic keyframe extraction. γ, δ are weighting factors used to balance the optimization of fine-grained feature extraction and action recognition.
In this problem, represents the optimization model for dynamic keyframe extraction, while
and
represent the complex mathematical models of additional feature extraction and fusion layers, enhancing the model’s capability to process spatiotemporal data. Ω, Φ, Ψ are sets of network parameters, covering weights and biases across all layers of the hybrid neural network architecture.
Problem 5 focuses on fine-grained action recognition, aiming to precisely capture subtle facial changes of drivers, particularly those minor but critical signs of fatigue. This fine-grained recognition is crucial for improving the accuracy of fatigue driving detection. Our challenge is to adjust the network to sensitively respond to subtle facial movements, such as minor eye movements or brief gaze shifts. This not only requires complex mathematical models and optimization strategies but also a profound understanding of human behavior characteristics for effective and accurate detection of fatigue driving.
Corollary 2 (Advantage of Composite Action Recognition). Let represent a 3D convolutional network,
a Long Short-Term Memory network, and
a Composite Action Recognition Network. In
, the 3D convolutional network
is responsible for capturing the spatial features of facial actions, while the Long Short-Term Memory network
handles time-series data, extracting temporal characteristics of movements. The Composite Action Recognition Network
combines the advantages of both, resulting in superior performance in recognizing fatigue signs in continuous video frames compared to using either network alone. This combination can be mathematically expressed as:
(27) where Θ and Λ are the parameter sets of the 3D convolutional and Long Short-Term Memory networks, respectively, and α and β are coefficients for adjusting the importance of outputs from both networks. This combination enables
to capture instantaneous facial actions while also focusing on the evolution of movements over time, thereby enhancing the accuracy of fatigue sign recognition.
Proof is provided in the S1 Appendix.
3 Algorithm pseudocode
3.1 Pseudocode and explanation for Algorithm 1
Algorithm 1 combines the concepts of Problem 1, Problem 2, and Problem 3.
Algorithm 1: Comprehensive Fatigue Detection Algorithm 1
Data: Facial behavior video sequence
Result: Preliminary fatigue detection feature vector set
1 begin
// Content from Problem 1
2 Initialize the dynamic keyframe extraction model , see Eqs (8) and (9)
// Content from Problem 2
3 Initialize the composite action recognition network Ht, Gt, Rt, see Eqs (13), (14) and (15)
// Content from Problem 3
4 Initialize additional feature extraction and fusion layers Ft, Ct, see Eqs (16) and (19)
5 for each time window do
6 foreach video frame Vt do
7 Compute keyframe difference measure Dt, see Eq (8)
8 if keyframe extraction conditions are met then
9 Extract keyframe set St, based on optimization model
10 Perform composite action recognition on each keyframe
11 foreach keyframe St do
12 Apply 3D convolution and LSTM networks, compute Ht, see Eq (13)
13 Integrate feature extraction results, compute Gt, see Eq (14)
14 Generate preliminary fatigue detection result, compute Rt, see Eq (15)
15 Update additional feature extraction results, compute Ft, see Eq (16)
16 Update fusion layer results, compute Ct, see Eq (19)
17 end
18 end
19 end
20 end
21 return Preliminary fatigue detection feature vector set
22 end
3.2 Pseudocode and explanation for Algorithm 2
Algorithm 2 combines the concepts of Problem 4 and Problem 5, as well as the output of Algorithm 1.
Algorithm 2: Fine-Grained Fatigue Detection Algorithm 2
Data: Preliminary fatigue detection feature vector set from Algorithm 1
Result: Final fatigue detection result
1 begin
// Content from Problem 4
2 Initialize fine-grained action recognition models , see Eqs (25) and (26)
3 foreach feature vector Ft from Algorithm 1 do
// Content from Problem 5
4 Apply fine-grained action recognition model, compute Dt, see Eq (22)
5 for each time step t do
6 Combine fine-grained features, update fatigue state prediction, see Eq (25)
7 Compute final fatigue detection result, see Eq (26)
8 end
9 end
10 return Final fatigue detection result set
11 end
3.3 Time and space complexity of algorithms
3.3.1 Algorithm 1.
The time complexity of Algorithm 1 is mainly influenced by the complexity of keyframe extraction and the composite action recognition network. Assuming the length of the video sequence is N, the time complexity is O(N × K × M), where K is the number of keyframes in each time window, and M is the number of layers in the composite action recognition network. The space complexity is mainly determined by the storage of network parameters and intermediate feature vectors, estimated as O(K × M).
3.3.2 Algorithm 2.
The time complexity of Algorithm 2 is influenced by the fine-grained action recognition model. Assuming the size of the feature vector set output by Algorithm 1 is P, the time complexity is O(P × L), where L is the number of layers in the fine-grained action recognition model. The space complexity is mainly determined by the storage of model parameters and intermediate computation results, estimated as O(L).
4 Experiments
The proposed model is implemented using the PyTorch deep learning framework. All experiments are conducted on a workstation with a 2.10GHz Intel(R) Xeon(R) Silver 4116 CPU, 16GB RAM, an NVIDIA Tesla V100 GPU, and Ubuntu 16.04.
To evaluate the performance of the fatigue detection system, a series of tests were conducted under various experimental conditions. Experimental parameters include the characteristics of the dataset, model training parameters, keyframe extraction settings, etc. Table 2 details the experimental parameter settings. Data augmentation techniques, including rotation and flipping, significantly improve our model’s generalization capability. By introducing variations in facial orientation and posture, these techniques ensure robustness against the diverse manifestations of driver fatigue. This diversity in the training data helps prevent overfitting, enabling the model to recognize fatigue-related features across different individuals and scenarios effectively, thus enhancing detection efficiency.
4.1 Datasets
4.1.1 Yawn dataset.
The Yawn Dataset, available on Kaggle’s official website, contains two categories: Yawn and no-Yawn. The Yawn category includes 2528 JPG format images, while the no-Yawn category includes 2591 JPG format images. Each type features “mouth” characteristics from different races, genders, and ages. The data is split into a training set and a test set in a 4:1 ratio to train and validate the two algorithms proposed in this paper.
4.1.2 YawDDR dataset.
The YawDDR dataset is an extension of the standard YawDD dataset, which is a publicly available dataset for yawning detection. The YawDDR dataset serves as a benchmark for evaluating face detection, feature extraction, and yawning detection algorithms. This dataset is composed of 351 video clips featuring a variety of volunteers who differ in gender, age, nationality, and ethnicity. Captured within the confines of stationary vehicles under daylight, these videos exhibit subtle differences in lighting conditions. The dataset records each participant in three to four separate videos, showcasing a range of oral movements including talking, yawning, and a combination of both. Given the variety of facial expressions and the need for clarity in data analysis, these videos, often longer than a minute and containing numerous facial movements, have been divided into smaller segments, thus forming the comprehensive YawDDR dataset.
The video length in the YawDDR dataset is approximately 8 seconds, covering three distinct actions: talking, yawning, and yawning while talking. Sample images from the dataset before face segmentation are shown in Fig 1. Fig 2 presents 486 image sequences from the YawDDR dataset. The YawDDR dataset is used to validate the efficacy of the proposed method.
(N: Normal no yawning; Y: Yawning; YT: Yawning & Talking).
4.2 Experimental design and results
4.2.1 Model performance testing.
To test the two algorithms proposed in this paper, the YawDDR dataset was first processed using dynamic keyframe extraction. This method reduced the original 8-second videos with over 100 frames to data groups of about 30 frames each, ensuring no duplicate images in each group. Subsequently, facial information was extracted from each keyframe. Figs 3–5 show the keyframe cropping results for the YawDDR dataset, where a represents keyframes under Normal state, Talking state, and Yawning state.
4.2.2 Impact of brightness on the two Algorithms.
To validate the point proposed in this paper, the brightness of each keyframe in the YawDDR dataset was processed. Figs 6–8 show examples of the YawDDR dataset after brightness processing. (a) is an image with 100% brightness, (b) is an image with 70% brightness, and (c) is an image with 130% brightness.
Models were trained and tested on the Yawn and YawDDR datasets, respectively. Fig 9 shows the detection accuracy of Algorithms 1 and 2 on the Yawn and YawDDR datasets in the test set, meanwhile, Fig 10 shows the detected confusion matrix results of Algorithms 1 and 2 on the Yawn and YawDDR datasets in the test set, The experimental results indicate that Algorithm 2, proposed in this paper, has a higher detection precision than Algorithm 1, demonstrating that the improvements made in Algorithm 2 have a positive effect. Both algorithms performed better on the Yawn dataset compared to the YawDDR dataset. This is attributed to the Yawn dataset containing only mouth features, which provide less information and make classification detection easier. Table 3 presents a comparison of accuracy between this paper and other studies.
Then we analyzed the precision, recall and F1 score of the experimental results as shown in Fig 11. For these two data sets, Algorithm 2 is always better than Algorithm 1 in all three indicators (precision, recall and F1 score). In terms of accuracy, Algorithm 2 reaches peaks, especially on the Yawn dataset, demonstrating its increased ability to correctly identify positive instances. The recall metric measures the algorithm’s ability to find all actual positive instances and also shows the superior performance of Algorithm 2. Algorithm 2 is shown to be more efficient at capturing all relevant instances and does not lose as many true positives as Algorithm 1. The F1 score is a balanced metric that combines precision and recall, and Algorithm 2 scores significantly higher on both datasets. It is shown that Algorithm 2 has a better overall balance between precision and recall, indicating that it is a more robust algorithm in terms of correctness and completeness in handling detection tasks. In short, it shows that our algorithm can show the superiority of our algorithm no matter which index is used.
4.2.3 Impact of brightness changes on Algorithm 2.
Table 4 presents the detection results of the two models proposed in this paper on the YAWDDR dataset. The data indicates that the brightness of images affects the model’s detection results. In other words, in a visible environment, lower interior brightness leads to lower detection results. However, with sufficient interior light, the model’s detection accuracy improves by 0.01%-0.05%.
(Ave: Average; Y: Yawning; YT: Yawn & Talking; T: Talking).
To further observe the impact of interior lighting changes on the accuracy of the model, three levels of brightness were set: 70% as low brightness, 100% as medium brightness, and 130% as high brightness. To simulate changes in vehicle lighting, the keyframes were further processed to set varying levels of brightness, namely strong-medium-weak (HML), strong-weak-medium (HLM), medium-strong-weak (MHL), medium-weak-strong (MLH), weak-strong-medium (LHM), and weak-medium-strong (LMH). Figs 6–8 illustrate the keyframe changes in the YawDDR dataset with weak-medium-strong brightness.
Using the above method, datasets were processed as YawDDR_HML, YawDDR_HLM, YawDDR_MHL, YawDDR_MLH, YawDDR_LHM, and YawDDR_LMH. Algorithm 2 was trained using the same method, and the model detection results are shown in Fig 12. The graph indicates that the overall accuracy of the algorithm decreases when brightness variation is introduced. When the lighting change pattern is HML, the model experiences less interference and achieves higher accuracy. Inferring from the results of the second experiment, the strong light at the beginning aids model detection, while the weak light in the latter half has a minor impact. Consequently, HLM and HLM brightness variations yield higher detection accuracy than MHL and MLH, which in turn are more accurate than LHM and LMH.
4.3 Discussion
- The generalization of our model to real-time datasets presents a significant challenge, primarily due to the inherent variability in such environments, including fluctuating lighting conditions, diverse driver behaviors, and unpredictable external factors. While our current study demonstrates promising results on the Yawn and YawDDR datasets, real-world application scenarios might introduce complexities not fully captured by these datasets. The limitations of our study in handling rapidly changing lighting conditions in a real-car environment highlight the need for further research. Future work could explore the integration of adaptive algorithms capable of dynamically adjusting to varying environmental conditions to improve real-time applicability.
- Assessing driving fatigue under “challenge” conditions goes beyond lighting variations. Factors such as the driver’s face distance from the camera, orientation, and possible occlusions (e.g., sunglasses or other facial wear) can significantly affect the classification accuracy. Our current methodology does not explicitly account for these variables, which may impact the model’s performance in real-world scenarios. Acknowledging these limitations is crucial for guiding future enhancements of our fatigue detection system. Efforts to include more diverse and challenging conditions in our training and validation datasets will be vital for improving the robustness and reliability of the model.
- Looking ahead, potential developments for our work include exploring multi-modal data integration, such as combining visual cues with physiological signals (e.g., heart rate or skin conductance) for a more comprehensive assessment of driver fatigue. Additionally, implementing our proposed method in real-car testing under various operational conditions will be essential to evaluate its effectiveness in real-world scenarios. This practical assessment will help identify specific areas for improvement, particularly in dealing with rapidly changing lighting conditions and other environmental factors not fully simulated in controlled datasets.
5 Conclusion
This study proposes an innovative method for driver fatigue detection, aiming to enhance the safety and reliability of intelligent driving assistance systems. By combining CNN model with emotional state analysis, our approach demonstrates outstanding robustness in complex environments with varying illumination. The experimental results show that our method significantly improves both accuracy and computational efficiency compared to existing technologies, particularly in environments with drastic changes in lighting conditions. Furthermore, this study explores the impact of different lighting conditions on fatigue detection accuracy, finding that changes in brightness significantly affect model performance. Especially in experiments simulating interior lighting changes, the results indicate that different combinations of light intensity have varying effects on model accuracy, providing crucial insights for optimizing fatigue detection models in such environments.
Overall, this study not only proposes an effective method for driver fatigue detection but also provides robust evidence for the adaptability and reliability of intelligent driving assistance systems in complex environments. We anticipate that this approach can be further optimized and applied to make greater contributions to road safety and driver welfare.
References
- 1. Azam K, Shakoor A, Shah RA, Khan A, Shah SA, Khalil MS. Comparison of fatigue related road traffic crashes on the national highways and motorways in Pakistan. Journal of Engineering and Applied Sciences. 2014; 33(2).
- 2. Liu H, Zhang T, Xie H, Chen H, Li F. Real-Time Driver Fatigue Detection Based on ELM. Proceedings of ELM-2015 Volume 2: Theory, Algorithms and Applications (II). 2016; Springer. 423–435.
- 3. Richardson JH. The development of a driver alertness monitoring system. Fatigue and Driving. 2019; Routledge. 219–229.
- 4. Plămădeală V. Driving tiredness–the end enemy of the driver. Journal of Engineering Sciences. 2022; 3:9–22.
- 5. Davidović J, Pešić D, Lipovac K, Antić B. The significance of the development of road safety performance indicators related to driver fatigue. Transportation research procedia. 2020; Elsevier. 45:333–342.
- 6. Kim JH, Poulose A, Han DS. The extensive usage of the facial image threshing machine for facial emotion recognition performance. Sensors. 2021; MDPI. 21(6):2026. pmid:33809352
- 7.
Poulose A, Reddy CS, Kim JH, Han DS. Foreground Extraction Based Facial Emotion Recognition Using Deep Learning Xception Model. 2021 Twelfth International Conference on Ubiquitous and Future Networks (ICUFN). 2021; IEEE. 356–360.
- 8. Zhang Y, Hua C. Driver fatigue recognition based on facial expression analysis using local binary patterns. Optik. 2015; Elsevier. 126(23):4501–4505.
- 9.
Tao H, Zhang G, Zhao Y, Zhou Y. Real-time driver fatigue detection based on face alignment. Ninth International Conference on Digital Image Processing (ICDIP 2017). 2017; SPIE. 10420:6–11.
- 10. Jia H, Xiao Z, Ji P. Fatigue driving detection based on deep learning and multi-index fusion. IEEE Access. 2021; IEEE. 9:147054–147062.
- 11.
Sacco M, Farrugia RA. Driver fatigue monitoring system using support vector machines. 2012 5th International Symposium on Communications, Control and Signal Processing. 2012; IEEE. 1–5.
- 12.
Khan I, Abdullah H, Zainal MS, Anuar S, Hazwaj M, Mohamad M. Vision based composite approach for lethargy detection. 2014 IEEE 10th International Colloquium on Signal Processing and its Applications. 2014; IEEE. 82–86.
- 13.
Qunzhu T, Zhang R, Yan Y, Zhang C, Li Z. Improvement of random forest cascade regression algorithm and its application in fatigue detection. 2019 IEEE 2nd International Conference on Electronics Technology (ICET). 2019; IEEE. 499–503.
- 14. You F, Gong Y, Tu H, Liang J, Wang H. A fatigue driving detection algorithm based on facial motion information entropy. Journal of advanced transportation. 2020; Hindawi Limited. 2020:1–17.
- 15.
Dong BT, Lin HY. An on-board monitoring system for driving fatigue and distraction detection. 2021 22nd IEEE International Conference on Industrial Technology (ICIT). 2021; IEEE. 850–855.
- 16. Dong BT, Lin HY, Chang CC. Driver fatigue and distracted driving detection using random forest and convolutional neural network. Applied Sciences. 2022; MDPI. 12(17):8674.
- 17. Min J, Wang P, Hu J. Driver fatigue detection through multiple entropy fusion analysis in an EEG-based system. PLoS one. 2017; Public Library of Science San Francisco, CA USA. 12(12):e0188756. pmid:29220351
- 18. Han W, Zhao J, Chang Y. Driver behaviour and traffic accident involvement among professional heavy semi-trailer truck drivers in China. PLoS one. 2021; Public Library of Science San Francisco, CA USA. 16(12):e0260217. pmid:34855802
- 19. Moessinger M, Stürmer R, Mühlensiep M. Auditive beta stimulation as a countermeasure against driver fatigue. Plos one. 2021; Public Library of Science San Francisco, CA USA. 16(1):e0245251. pmid:33428673
- 20.
Poulose A, Kim JH, Han DS. Feature vector extraction technique for facial emotion recognition using facial landmarks. 2021 International Conference on Information and Communication Technology Convergence (ICTC). 2021; IEEE. 1072–1076.
- 21. Savchenko AV, Savchenko LV, Makarov I. Classifying emotions and engagement in online learning based on a single facial expression recognition neural network. IEEE Transactions on Affective Computing. 2022; IEEE. 13(4):2132–2143.
- 22. Ding E, Xu D, Zhao Y, Liu Z, Liu Y. Attention-based 3D convolutional networks. Journal of Experimental & Theoretical Artificial Intelligence. 2023; Taylor & Francis. 35(1):93–108.
- 23. Wang Y, He Z, Wang L. Truck Driver Fatigue Detection Based on Video Sequences in Open-Pit Mines. Mathematics. 2021; MDPI. 9(22):2908.
- 24.
Kielty P, Dilmaghani MS, Ryan C, Lemley J, Corcoran P. Neuromorphic sensing for yawn detection in driver drowsiness. Fifteenth International Conference on Machine Vision (ICMV 2022). 2023; SPIE. 12701:287–294.
- 25. Yang H, Liu L, Min W, Yang X, Xiong X. Driver yawning detection based on subtle facial action recognition. IEEE Transactions on Multimedia. 2020; IEEE. 23:572–583.
- 26. Majeed F, Shafique U, Safran M, Alfarhood S, Ashraf I. Detection of drowsiness among drivers using novel deep convolutional neural network model. Sensors. 2023; MDPI. 23(21):8741. pmid:37960441
- 27. Mzoughi H, Njeh I, Wali A, Slima MB, BenHamida A, Mhiri C, et al. Deep multi-scale 3D convolutional neural network (CNN) for MRI gliomas brain tumor classification. Journal of Digital Imaging. 2020; Springer. 33:903–915. pmid:32440926
- 28.
Kalfaoglu ME, Kalkan S, Alatan AA. Late temporal modeling in 3d cnn architectures with bert for action recognition. Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16. 2020; Springer. 731–747.
- 29. Zhao S, Tao H, Zhang Y, Xu T, Zhang K, Hao Z, et al. A two-stage 3D CNN based learning method for spontaneous micro-expression recognition. Neurocomputing. 2021; Elsevier. 448:276–289.