Monitoring social distancing under various low light conditions with deep learning and a single motionless time of flight camera

The purpose of this work is to provide an effective social distance monitoring solution in low light environments in a pandemic situation. The raging coronavirus disease 2019 (COVID-19) caused by the SARS-CoV-2 virus has brought a global crisis with its deadly spread all over the world. In the absence of an effective treatment and vaccine the efforts to control this pandemic strictly rely on personal preventive actions, e.g., handwashing, face mask usage, environmental cleaning, and most importantly on social distancing which is the only expedient approach to cope with this situation. Low light environments can become a problem in the spread of disease because of people’s night gatherings. Especially, in summers when the global temperature is at its peak, the situation can become more critical. Mostly, in cities where people have congested homes and no proper air cross-system is available. So, they find ways to get out of their homes with their families during the night to take fresh air. In such a situation, it is necessary to take effective measures to monitor the safety distance criteria to avoid more positive cases and to control the death toll. In this paper, a deep learning-based solution is proposed for the above-stated problem. The proposed framework utilizes the you only look once v4 (YOLO v4) model for real-time object detection and the social distance measuring approach is introduced with a single motionless time of flight (ToF) camera. The risk factor is indicated based on the calculated distance and safety distance violations are highlighted. Experimental results show that the proposed model exhibits good performance with 97.84% mean average precision (mAP) score and the observed mean absolute error (MAE) between actual and measured social distance values is 1.01 cm.


Introduction
belongs to the family of coronavirus caused diseases, firstly reported in Wuhan, China at the end of December 2020. China has announced its first death from the virus on January 11, a 61 years old man. On March 11, World Health Organization (WHO) [1,2] declared it pandemic due to its spread over 114 countries with a death toll of 4000 and active cases of 118000 [3]. Data from Johns Hopkins University showed that more than seven million people a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 were confirmed to have the coronavirus with at least 406,900 dying from the disease on June 8. Several health organizations, scientists, and doctors tried to develop vaccine but no success is observed so far. This situation forces the world to find out an alternative solution to avoid drastic results. Lockdown was imposed globally and maintaining safe social distance is reported to be the alternate solution to cope with this drastic situation. The term social distancing is the best idea in the regulation of efforts made to minimize the spread of COVID-19 [4]. The basic objective is to reduce the physical contact between the infected and the healthy people. As prescribed by WHO, people should maintain at least 1 meter (m) distance from each other to control the spread of this disease [1,5,6].
This paper aims to mitigate the effects of coronavirus disease along with minimum loss of resources; this disease has badly impacted the global economy. Secondly, to provide a highly accurate solution for the detection of people to help out in monitoring social distancing during the night. Especially, in summer when the heat is at its peak, people having congested homes find ways to get out of their homes during the night with their families to take fresh air. During this serious situation, it is necessary to take proper action. Recently, Eksin et al. [7] evaluated the susceptible infected recovered (SIR) model where they included a social distancing term. They showed that the spread of disease depends upon people's social behavior. They assessed the results of the SIR model with and without behavior change factor and found that a simple SIR model did not get well performance even after many repeated observations; whereas, their updated SIR model with behavior change factor showed good results and corrected the initial error rate. In a similar context, a landing AI [8] company has declared the development of an AI tool for monitoring social distance in the working area. In a short report [8], the firm professed that the prospective tool will be able to observe people, whether they are following safety distance criteria by examining real-time video streams captured by the camera. They affirmed that this tool can be easily combined with available security cameras in different working areas to ensure a safe distance between workers. The world-leading research company Gartner Inc. [9] declared landing AI as cool vendors in AI core technologies to acknowledge their timely incentive to support the fight against the deadly situation of COVID-19 [10].
In this article, a deep learning-based solution is proposed for the automatic detection of people and monitoring social distance in low light environments. The first contribution of this article is the performance evaluation of YOLO v4 on low light conditions without applying any image cleansing approaches. As in past low light environments are not much focused, few have focused the problem but only in the context of enhancing low light scenarios and improving visibility [11][12][13][14]; whereas, in the real-time object detection and monitoring, this approach is not feasible because it takes more time to enhance low light scenarios at first place and then apply object detection techniques. So, the real-time application should have to give a timely response with high accuracy. Secondly, a social distance monitoring solution is proposed by considering precise speed-accuracy tradeoff and is evaluated on our custom dataset. From experimental results, it is observed that the model exhibited good performance with a balanced mAP score and MAE [15] of 1.01 cm.

Related work
In this section, we briefly introduce previous work done on the social distancing in the context of the 2019 novel coronavirus disease. As the disease spread at the end of December, researchers started work to pay their contributions in the deadly situation. Social distancing was suggested as the alternative solution. The different research studies were conducted to provide an effective social distancing solution. In the same background, Prem et al. [16] studied the consequences of social distancing measures on the progression of the COVID-19 epidemic in Wuhan, China. They used synthetic location-specific contact patterns to imitate an ongoing trajectory outbreak using age structure susceptible-exposed-infected removed (SEIR) models for several social distancing measures. They interpreted that a sudden rise in interventions will lead to an early secondary peak but it will flatten gradually with time. As we all can understand social distancing is important to cope with the current situation but economically it is a drastic measure to flatten the curve against infectious diseases. Adolph et al. [17] emphasized the situation of USA where they gathered state-level responses regarding social distancing and found the contradiction in the decision among policymakers and politicians which causes a delay in imposing the social distancing strategies resulting in ongoing harm to public health. On the brighter side, social distancing helped a lot to control the spread of disease but it has also affected economic productivity. In the same background, Kylie et al. [18] have studied the association between transmissibility and social distancing and found that association decreases as transmissibility decreases within different provinces of China. According to the study, the intermediate level of activity could be allowed while avoiding an immense outbreak.
Since the COVID-19 pandemic began, many countries are seeking for technology-oriented solutions. Asian countries have used a range of technologies to fight against COVID-19. The most used technology is tracking location by phones where the data of COVID-19 positive people are saved, based on this data their near about healthy people are monitored. Germany and Italy are using anonymized location data to monitor lockdown. UK has launched an application (app) named C9 corona symptom tracker [19] that helps people to report their symptoms. Similarly, South Korea launched an app named Corona 100m [19] that has stored the location of infected people and generate alert to healthy people when they came near to corona patients at a distance of 100m. India has developed an app that helps people to maintain a specific distance from a person who has tested corona positive. Besides this, India, South Korea, and Singapore are taking benefit from CCTV footage [19] to monitor the recently visited places of COVID-19 patients to track down the infected people. China is utilizing AI-powered thermal cameras [19] to identify those people in the crowd having the temperature. Such inventions in this drastic situation might help to flatten the curve but at the same time, it results in a threat to the personal information.
Object detection helped a lot in this deadly situation. Many of the researchers have investigated the situation [20][21][22][23] to detect various types of objects to help out the scenario. Human detection [24][25][26][27] is an established area of research. Recent advancements in this field [28,29] had created the demand for intelligent systems to monitor unusual human activities. Despite the fact, human detection is an interesting field because of many reasons like faint videos, diverse articulated pose, background complexities, and limited machine learning capabilities; hence, existing knowledge can boost the detection performance [20]. Narinder et al. [21] motivated by the notion of social distancing proposed a deep learning-based structure to automate the task of observing social distance using surveillance video [22]. They used YOLO v3 [30] algorithm with a deep-sort technique for the separation of people from the background and tracking of detected people with the help of bounding boxes. Cob et al. [23] investigated the relation of COVID-19 growth rates in US with shelter in place orders (SIP). They presented a random forest machine learning model for their predictions and found the SIP orders very effective. Their study showed that SIP orders will not only be helpful for the US but also will help highly populated countries to reduce the COVID-19 growth rate. Deep learning is the popular area to perform object detection which gained a huge interest in the modern research field. Deep learning techniques have successfully applied in the drastic situation of COVID-19 by automating the task of face mask detection [31], detection of COVID-19 cases with X-ray images [32], lung infection measurement in CT images [33], COVID-19 patients monitoring [34] and most importantly monitoring social distancing [20][21][22][23].
Different research studies were conducted to provide a better and effective social distance monitoring solution as we discussed above but no one has focused on the low light environments. Besides this, we have not found any real-world unit distance mapping solution. To fillup this research gap, this article mainly focuses on low light conditions and to come up with a real-world unit distance mapping strategy that simplifies social distance monitoring tasks to help out in this deadly situation. ) classifier run on region proposals that are considered as bounding boxes. These algorithms exhibit good performance, especially Faster R-CNN with an accuracy of 73.2% mAP, but because of their intricate pipeline, they show poor performance in the context of speed with 7 frames per second (FPS), which limit them for real-time object detection. This is where YOLO fits, a real-time object detection system with a creative perspective of reviewing object detection as a regression problem was introduced in 2016 by Joseph et al. [39]. YOLO exhibits good performance as compared to previous region-based algorithms in terms of speed with 45 FPS by maintaining good detection accuracy of 63.4% mAP. Despite good speed and performance, YOLO made notable localization errors. Moreover, YOLO has low recall. To resolve the shortcomings of YOLO, in the same year authors of YOLO released YOLO second version where recall and localization were mainly focused without affecting classification accuracy. YOLO v2 [40] gained a speed of 67 FPS and mAP reached 76.8%. YOLO v2 is also called YOLO 9000 because of its ability to detect objects of more than 20 classes by mutually optimizing classification and detection. The YOLO v3 [30] developed in 2018 brought new improvements in speed and accuracy, but the main idea remained the same.

Background of deep learning models
Recently YOLO v4 is released by Alexey et al. [41]. In comparison with its direct predecessor YOLO v3, average precision (AP) and FPS increased by 10 to 12 percent. In experiments on the MS COCO [42] dataset, it obtained 43.5% AP score and achieved a real-time speed of approximately 65 FPS on Tesla V100, vanquishing over the most accurate and fastest detectors in terms of both accuracy and speed. Most of the detectors require multiple GPUs for training with a large batch size; whereas, training on a single GPU makes the training process very slow. YOLO v4 resolved this issue by presenting a fast and accurate object detector that can be trained with a smaller batch size on a single GPU. Below we have briefly described the architecture of general object detectors and the newly introduced YOLO v4 model.

General architecture of object detector
Ordinary object detectors like R-CNN, Fast R-CNN, and Faster R-CNN are two-stage detectors made up of three parts: backbone, neck, and head. • Neck: Extra layers lie between backbone and head which will be helpful for feature map extraction from previous backbone stages. Different feature map extraction techniques are used, e.g., YOLO v3 uses Feature Pyramid Network (FPN) [46] for extraction of feature maps of different scales from the backbone, where every next layer gets in input the merged results of previous layers and produces different levels of the pyramid. Classification/ regression (head) is applied on every pyramid level which helps in the detection of different sizes of objects.
• Head: This is responsible for assigning a class to objects and generating bounding boxes around it (classification and regression). One stage detectors like YOLO apply classification/ regression to each anchor box.

YOLO v4 architecture
In this section, we discuss YOLO v4. Fig 1 shows a diagrammatic representation of YOLO v4 architecture.
• Backbone: It employs CSPDarknet53 as a feature extractor with a graphics processing unit (GPU). Few backbones are more appropriate for classification than for detection. For example, CSPResNext50 is better than CSPDarknet53 for image classification; whereas, CSPDar-knet53 is proved better in terms of object detection. For better detection of small objects, the backbone model needs a higher network size as an input and for higher receptive fields more layers are required. • Head: YOLO v4 utilizes the same head as YOLO v3 with the anchor-based detection steps.
YOLO v4 performance optimization. The authors of YOLO v4 differentiated between two types of methods that are used to improve object detector's accuracy. They examined both types of methods to obtain fast operating speed with high accuracy. Both types are as follows: • Bag of Freebies (BoF): Procedure that produces an object detector that delivers better accuracy without increasing inference cost. One of its examples is data augmentation, the model trained on small datasets has poor generalization ability which leads these models towards overfitting. Overfitting is the problem that usually arises when a deep neural network tries to learn the most frequently occurring pattern. As several methods were proposed to resolve the problem of overfitting.  (1).
MSE treats variables as self-sufficient rather than unified. To surpass this, IoU [49] loss is proposed, which takes into account the area of the ground truth bounding box and predicted bounding boxes (BBox). This notion is further enhanced by GIoU [50] loss by adding orientation and shape of an object with the area. Besides GIoU, CIoU is introduced which takes into account overlaying area, aspect ratio, and distance between center points. YOLO v4 uses CIoU loss for bounding boxes, because of its good performance and faster convergence.

Training dataset
In this paper, to tune up the object detection model for human detection under various low light conditions, a recently released ExDARK dataset [54] is considered which specifically focuses on a low-light environment. In this dataset, 12 different classes of objects are labeled, out of which we fetched data of our desired class for training. This dataset contains different

Testing dataset
A custom dataset is used for the evaluation of the proposed model. The dataset is collected from the market of Rawalpindi, Pakistan during the night in the days of COVID-19. Pakistan is one of the most urbanized countries in South Asia with a 3% yearly urban population growth rate. The large population and congested streets make it a riskier place in the growth of COVID-19 and it is very difficult to maintain safety distance in such narrow places. Hence, the monitoring system should need to have high accuracy in terms of the detection and location of the people. Evaluation of the proposed framework in such a highly-populated area will help us to better analyze the performance of the model. Test dataset is the collection of 346 RGB frames. Frames are collected with motionless ToF camera of Samsung galaxy note 10+ installed 4.5 feet above the ground where a 0˚regular camera view calibration is adopted. Sample images of low-light conditions from the custom dataset are shown in

Monitoring social distancing with deep learning and a single motionless time of flight (ToF) camera
The emergence of deep learning has caught much attention and became a presiding technology that introduced a variety of techniques to solve different challenges including self-driving [55], fraud detection [56-58], robotics [59], language translations [60], medical diagnosis [61], and many more [62]. Most of these challenges revolve around object detection, classification, segmentation, recognition, and tracking, etc. In this research article, a deep learning-based solution is proposed that uses an object detection model for automating the task of social distance monitoring at fixed camera distance (C d ) under various low light environments. To monitor social distance at C d motionless ToF [63] camera is utilized along with the YOLO v4 algorithm to maintain speed-accuracy tradeoff.
ToF cameras give real-time distance images which simplify human monitoring tasks. These cameras utilize light pulses. The light of the camera is switched on for a short time interval and the resultant light pulse brightens the scene and comes back by striking the object. This reflected light encounters a reflection delay depending on the distance of the object. The camera lenses assemble the incoming light and create an image on the sensor. ToF camera to object distance is calculated by Eq (3).
Where S L is the speed of light, L p is the length of the pulse, S 1 is gathered charge when light is emitted and S 2 represents the charge when there is no light emission. The view V captured by ToF camera is the three tuple value V = (F, T D , C p ), where F is an RGB frame with height and width, T D is a safe distance threshold value, and C p shows camera position in real world environment. In a given V we are eager to find number of people p o = (p 1 , p 2 , p 3 , . . ., p n ) and their self-distance PD ¼ ðED p1;p2 ; ED p1;p3 ; ‥; ED p1;p n ; ED p2;p3 ; ED p2;p4 ; ‥; ED p2;p n ; . . . ; ED p nÀ 1 ;p n Þ where ED 2 < + and p n is overall people detected in one frame. We are also keen to find the value of safety threshold T D to monitor safety distance violations (PD < T D |PD = T D |PD > T D ).  , y1), (x2, y2), . . ., (x n , y n )}.

Specifying T D in F
The considered safety threshold value to control the spread of disease is 100 cm as specified by WHO [1]. For initializing the monitoring process we have placed two temporary targets (T1, T2) in the real-world environment with the actual self distance D T1T2 of 100 cm at C d and capture image. The captured image is passed to T m and calculated Euclidean distance E d between CP i of detected bounding boxes by Eq (4). The calculated E d gives us distance between T1 and T2 in F in the form of pixels which is equivalent to real-world unit distance D T1T2 . This E d will be used as a threshold value to filter newly coming people in the V. The environmental arrangement of ToF camera with target objects T1, T2, and safety threshold distance D T1T2 is shown in Fig 4. E d ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi

Pixels to real-world unit distance mapping
To convert T D from pixel distance to unit distance (cm) we found that T D is directly proportional to D T1T2 as described in Eq (5).
Here k is the constant which represents one pixel which is equivalent to D T1T2 T D units. We convert the distance between the center points of newly coming objects at C d in V into units by Eq (6).
Where Du i is measured distance in units, k is constant which stores pixel to unit equivalent value, and PD is the Euclidean distance between the CP i of all detected persons in F. The workflow of the proposed model is shown in Fig 5.

Experimental setup
In the ExDARK image classification experiment, the selection of hypermeters are as follows: training steps are 35000 and 50000 at two different network sizes 320 and 416; batch size and subdivisions are 64 and 16; the polynomial decay learning rate scheduling strategy is adopted with an initial learning rate 0.001; the warm-up steps are 1000; Momentum and weight decay of 0.949 and 0.0005 respectively. From a bag of freebies (BoF) mosaic data augmentation technique is utilized. From the bag of specials (BoS) mish and leaky-ReLU [64] activation functions are used. The network size is 320 × 320 and 416 × 416 with 3 channels and the initialized IoU threshold for ground truth allocation is 0.213. The IoU normalizer is 0.07 and CIoU loss is used for bounding boxes. To cut off a large number of rectangular boxes and choose the best one greedy non-maximum suppression (NMS) is used. The experiments are done on Tesla T4 GPU with 16 GB memory, CUDA v10010, and cuDNN v7.6.5.

Evaluation standards
Common evaluation indicators for object detectors are Precision, Recall, and AP. The subsequent explains the purpose of these indicators in the context of person detection under various low light conditions. Precision shows how accurately the model has predicted the people.
Recall is described as the number of truly detected people over the sum of truly detected people and undetected people in the image. AP is the mean of the precision score after every true object is detected as shown in Eq (7). It comprehends the performance of the object detection algorithms. Having extensive assessment ability AP is used as an assessment indicator in this research which is equivalent to mAP in COCO detection metrics [42].  (8) and (9) whereas, F1-score is calculated by the resultant values of precision and recall as described in Eq (10). By summarizing the evaluation results based on the mAP, we can see that the model exhibited overall good performance, network size 416 with IoU threshold 0.5 have the highest mAP value of 97.84%. The Precision-recall curve (PR-curve) of COCO evaluation at the IoU threshold ranges from 0.5 to 0.95 at two network sizes is shown in Fig 6. Precision Recall

Detection results
We have tested our trained model on a custom dataset. Detection results per frame extracted from the video are shown in Fig 7. Table 2 shows TP, FP, FN, precision, and recall values for detected objects per frame. The model exhibited overall good performance in low light environments, from Table 2 it can be observed that no false positive is detected in any of the frames; whereas, the number of false-negatives is also low. PR-curve from precision-recall values of Table 2 is shown in Fig 8, we noticed that the precision values remained constant from Frame1 to Frame15.

Experimental results
To evaluate the performance of our social distance monitoring solution, we perform few tests at three different fixed camera distances 400 cm, 500 cm, and 600 cm. Test frames are collected from the motionless ToF camera of Samsung galaxy note 10+ placed 4.5 feet above the ground where C p is 0˚(a regular camera view). At each specific fixed camera distance, we tested 2 scenarios one above the specified safety threshold (100 cm) at 140 cm and one below the specified safety threshold at 52 cm. Qualitative results are shown in Fig 9; whereas, Table 3 shows the   (12). The Ad and Du plot is shown in Fig 10, where the blue color shows the actual known distance in cm and the red line shows the measured distance in cm.

Limitations and discussion
This application is meant to be used in a real-time environment so, precision and accuracy are highly required to serve the motive. The proposed model shows efficient results during the evaluation of the YOLO v4 model in low light conditions where no single FP is detected, as the accuracy and reliability of the model is highly dependent on FP. To evaluate the performance of the social distance monitoring strategy few Tests are performed, as shown in Table 3. The proposed deep learning and motionless ToF camera-based social distance monitoring technique at C d shows a good speed-accuracy tradeoff in monitoring social distancing during the night. The technique is limited to a few scenarios, social distance among people can be only monitored at fixed C d values. Secondly, in order to initialize the monitoring process, we have to place two temporary target objects in an environment.

Conclusion
This article proposes an efficient solution for real-time social distance monitoring in low light environments. For real-time person detection, the YOLO v4 algorithm is trained on the  ExDARK dataset. For monitoring social distance, a motionless ToF camera is used to observe people at fixed camera distance and show resultant distance in real-world units. Safety distance violations are highlighted. The proposed YOLO v4 based real-time social distance monitoring solution is evaluated by COCO detection metrics. Experimental analysis shows that the YOLO v4 algorithm achieved the best results in different low light environments with 97.84% mAP score and the observed MAE value during the test of our social distance monitoring approach is 1.01 cm. The FPS score can be more enhanced by fine-tuning the same approach on GPUs like Volta, Tesla V100, or Titan Volta. The proposed technique can be easily applied in real-world scenarios because of high precision and the low error rate, e.g., in banks to help the cashier to monitor people standing in front of him, in shops to help shopkeepers to observe customers, in train stations to help ticket giver to keep track of people violating safe distance, etc. In the future, we will extend our system to monitor social distance at varying camera distances by managing objects varying camera angles.