Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

An anomaly detection scheme for data stream in cold chain logistics

  • Zhibo Xie ,

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Writing – original draft, Writing – review & editing

    xiezhibo@zwu.edu.cn

    Affiliation School of information and intelligent engineering, Zhejiang wanli University, Ningbo, China

  • Heng Long,

    Roles Software, Visualization

    Affiliation School of information and intelligent engineering, Zhejiang wanli University, Ningbo, China

  • Chengyi Ling,

    Roles Software

    Affiliation School of information and intelligent engineering, Zhejiang wanli University, Ningbo, China

  • Yingjun Zhou,

    Roles Resources, Validation

    Affiliation Ningbo Municipal Bereau of Ecology and Enviroment, Ningbo, China

  • Yan Luo

    Roles Data curation, Validation

    Affiliation Ningbo Municipal Bereau of Ecology and Enviroment, Ningbo, China

Abstract

Anomaly detection is widely used in cold chain logistics (CCL). But, because of the high cost and technical problem, the anomaly detection performance is poor, and the anomaly can not be detected in time, which affects the quality of goods. To solve these problems, the paper presents a new anomaly detection scheme for CCL. At first, the characteristics of the collected data of CCL are analyzed, the mathematical model of data flow is established, and the sliding window and correlation coefficient are defined. Then the abnormal events in CCL are summarized, and three types of abnormal judgment conditions based on cor-relation coefficient ρjk are deduced. A measurement anomaly detection algorithm based on the improved isolated forest algorithm is proposed. Subsampling and cross factor are designed and used to overcome the shortcomings of the isolated forest algorithm (iForest). Experiments have shown that as the dimensionality of the data increases, the performance indicators of the new scheme, such as P (precision), R (recall), F1 score, and AUC (area under the curve), become increasingly superior to commonly used support vector machines (SVM), local outlier factors (LOF), and iForests. Its average P is 0.8784, average R is 0.8731, average F1 score is 0.8639, and average AUC is 0.9064. However, the execution time of the improved algorithm is slightly longer than that of the iForest.

Introduction

CCL is a supply logistics chain which uses refrigeration technology to maintain a suitable environment for those perishable products such as fruits,vegetables, dairy, meats, fish and medicine and so on [1]. With the improvement of living standards, people have higher requirements for food quality, CCL is more widely in food transportation. As the main way to ensure food quality, CCL anomaly detection has become more important. Unlike other anomaly detection techniques, CCL anomaly detection technology faces many challenges, such as data stream being in strong noise and interference environments; The measurement error will be large due to the sensor being in a low-temperature environment; The accuracy will decrease due to long-term use of the sensor; Real time requirements are high, but the amount of data generated during the monitoring process is huge and the data dimension is high, and so on. Therefore, although there are so many anomaly detection algorithms, there are few effective algorithms for CCL anomaly detection, and the performance of the currently used algorithms is poor. So it is necessary to develop efficient algorithms for CCL.

A safety warning system is designed for a warehouse with a wireless communication network and multiple sensors to monitor the surrounding conditions such as fire and burglary [2]. The paper considers the instantaneous value of the sensor. As soon as the instantaneous value exceeds the threshold value, an alarm is given. Mariusz developed the theoretical basis for a rack technical parameter monitoring program to ensure structural reliability and avoid potential collisions in high compartments or high storage warehouses [3]. The paper mainly focuses on pressure and collision, and does not study other CCL parameters. A monitoring system based on multi-dimensional sensing is proposed in the paper [4]. But the sampling data must be read through a dedicated reader, so the hardware cost of the system is too high. Aiming at the quality problem in the CCL of dairy products, a new method of CCL pre-warning for dairy products based on support vector machine was proposed [5]. But the paper does not consider that the sampled data has the characteristics of data flow. Wang mentioned that smart tags can be used to detect food quality changes in CCL detection, but did not give a detailed hardware and software implementation [6].

In [7], a fresh food sensory perceptual system is designed for CCL, the cost is too high because of the specialized website and remote server. Wang S.X. researches the Tilapia Cold-chain Logistics abnormal temperature detection method, the structure of fuzzy ARMM was introduced in abnormal detection. The experimental results show that the proposed method to test the tilapia abnormal temperature of CCL, and overall performance are higher accuracy [8]. Feng researched a real-time monitoring system for fruit and vegetable CCL with ZigBee technology [9]. Liu studied the phase change cold storage materials in the CCL [10]. Witjaksono designed a temperature warning system, which can be triggered and send signals to the refrigeration system to adjust the refrigeration ventilation and keep the temperature in the suitable range [11]. A cold chain database platform based on network is developed for collecting and managing real-time temperature data to optimize and improve weak links in the supply chain [12]. Azzi et al. utilized blockchain technology to achieve distributed storage and management of supply chain data, to ensure data integrity, accuracy, and security [13]. An intelligent route-planning system based on IoT technology is designed by Tsang et al. The system is configured a WSN to monitor the entire chain in real-time to ensure the stable temperature during transportation [14]. Han predicted the forced air cooling efficiency of fresh apples by combining the optimal differential evolution algorithm with backpropagation neural network [15]. Neural network models require a large amount of data for pre training, and the amount of data will affect the efficiency of monitoring and analysis. The method of using thermal imaging to achieve two-dimensional images with multi-point temperature sensing has been proposed, while its accuracy is affected by the reflectivity of the monitored object [1618]. The statistical process control and sensor networks is combined to used to detect and control cold chain temperature [19].

Hoang suggests that sustainable environmental control of refrigeration can be achieved by predicting future temperature changes [20]. The method of combining real-time environmental monitoring of the Internet of Things with early risk warning decision support systems to reduce food losses [14,21]. R. Jedermann found that local offset from the average value are much larger than expected when sensor networks are applied in containers [22].

Anomaly detection algorithms are generally divided into three kinds: unsupervised detection, semi-supervised detection and supervised detection [23]. If data labels can be obtained, supervised anomaly detection is preferred. KNN (K-Nearest-Neighbors), SVM (Support Vector Machine) are typical supervised detection algorithms. When there are only a few data labels, the semi- supervised anomaly detection model can also be used. But in fact, anomaly detection is often unlabeled, and the training data does not indicate which are abnormal points, so unsupervised detection should be used. Principal Component Analysis (PCA), one class SVM, Angle-Based Outlier Detection (ABOD), LOF (Local Outlier Factor), isolated Forest are the main models of unsupervised detection [2426]. Jie Tang proposed a supervies mechine model with SVM to control the CCL [27]. Wei Wu researched a platform with generative adversarial networks (GAN) and digital twin for CCL [28], which can be used for accident identification and indoor localization based on Bluetooch Low Energy to actualize real-time staff safety supervision in the cold warehouse. X. M. designed a cost-effective over-temerature alarm system using an artificial neural network model [29]. A unsupervised deep neural structure of stacked auto-encoder (SAE) was designed to identify abnormal stationary from human motion status [30]. One-dimensional point monitoring is currently the main way to measure the temperature and humidity of the cold chain envi ronment, and the results provide important parameters for evaluating whether the temperature and humidity meet the requirements of fresh produce and whether the food quality and safety are maintained [31]. This method of judging food-safety is one-sided, subjective, and unscientific due to problems such as a limited number of sensors and their impre cision, the uneven distribution and/or fluctuation of temperature and humidity in all stages of the cold chain, and the temperature and hu midity gradient that exists between the food and the environment, especially for packed fruits. Abdella et al. [32] and Badia-Melis et al. [33] used ANNs for the data correction and time series predic tion of single-point sensors for food CCL. The future temperature trends and demand disturbances of the cold chain have been accurately determined by back propa gation and deep learning neural networks (long short term memory [LSTM], stacked LSTM, bidirectional LSTM, convolutional LSTM), which have even replaced active RFID tags [20]. The forced air-cooling abnomal for fresh apples has been predicted by combining the optimal differential evolution algorithm and the back propagation neural network [15]. The main shortages of anomaly detection algorithms for CCL in the above documents are: (i) The particularity of CCL data acquisition is not considered, that is, it is usually necessary to arrange multiple sensors at different positions of a vehicle to make the measurement results more accurate. The data collected by each sensor, to be exact, should be data streams. Moreover, according to the actual situation, multiple sensors are often arranged in a CCL carriage. Taking temperature sensors as an example, generally at least 5 or more temperature sensors are arranged in different positions to obtain more accurate temperature. Therefore, CCL’s data stream is a high-dimensional real-time data stream. (ii) The anomaly detection algorithms mentioned above are all focused on the same type of sensor, such as temperature, and do not include all commonly used sensors in CCL vehicles, such as temperature, humidity, oxygen concentration, carbon dioxide concentration, and pressure values. (iii) The multidimensional sensor data streams have strong temporal and spatial correlations, as well as strong noise.

In this paper, a novel CCL anomaly detection scheme is proposed, which not only considers the characteristics of data flow of the collected data of CCL, but also comprehensively considers the anomaly detection of multiple types of data. Firstly, the characteristics of the collected data of CCL are analyzed, the mathematical model of data flow is established, and the sliding window | W | and correlation coefficient are defined. Then the abnormal events in CCL are summarized, and three types of abnormal judgment conditions based on correlation coefficient are deduced. Subsampling and cross factor are designed and used to overcome the shortcomings of the isolated forest (iForest) algorithm in detecting outliers with too many samples and too high dimensions. Experimental analysis shows positive results and demonstrates the effectiveness of designed CCL abnormal detection algorithm, compared with the state-of-art algorithm. Three main contributions of this paper can be summarized as follows:

  1. (i) The characteristics of CCL multi-dimensional sensor data streams are analyzed, and mathematical analysis models for CCL data flow are deduced and established.
  2. (ii) Three kinds of abnormal conditions in CCL are analyzed in detail. Based on the correlation characteristics of multi-dimensional data flow, the judgment scheme of each anomaly is given.
  3. (iii) An improved isolated forest algorithm is proposed for the most common abnormal situation. The crossover factor is used to build new iTree and new forest. Compared with the common algorithms such as SVM, LOF and iForest, experiments show that the improved algorithm has better P, R, F1 score and AUC without increasing the computational complexity, and the executing time is shorter.

The remaining paper is organized as follows. Section 2 details the proposed algorithm. Section 3 shows experimental results. Section 4 gives discussion. Finally, summary is given in Section 5.

Materials and methods

The flowchart of the proposed algorithm, including training and testing, is shown in Fig 1. Flowchart of abnormal detection for CCL.This section introduces the proposed scheme in detail.

thumbnail
Fig 1. Flowchart of abnormal detection for CCL.

Firstly, the characteristics of multi-dimensional sensor data are analyzed, the mathematical model of data stream is established, and the sliding window, correlation coefficient and variance are defined. Then, the main three kinds of abnormal events are analyzed, and the judgment conditions of abnormal events based on correlation coefficient are deduced. Finally, a measurement anomaly detection scheme based on the improved iForest algorithm is given, including the algorithm flow chart and the pseudocode of key functions.

https://doi.org/10.1371/journal.pone.0315322.g001

Analysis and modeling of multi-sensor data stream

In the CCL environment, sensor data is a typical data flow. The so-called data stream, also known as stream data, refers to a data sequence that can only be read once in a predetermined order. Data has the characteristics of fast, infinite and continuous arrival. In addition, unlike the general data flow, the sensor data in the CCL has obvious temporal and spatial correlation. First, there is an obvious time correlation between the historical data collected by the same sensor. Then, in order to improve the reliability of the system, multiple sensors of the same type are usually used for multiple azimuth measurements. There is a certain spatial correlation between different sensors. Different types of sensors in the same space also have certain correlation, for example, the temperature and humidity in the space have significant negative correlation. How to make full use of the temporal and spatial correlation between sensor data to improve the accuracy of anomaly detection is one of the issues that need to be considered in sensor data anomaly detection.

Suppose that there are N sensors (such as temperature, humidity, CO2 concentration, etc.) configured to collect M different modes in a mobile terminal for data collection. Each node deployed in wireless sensor network can ensure the synchronization of data acquisition and information transmission through time synchronization mechanism. At a certain sampling time t, the data of M different modes collected by any sensor node can be regarded as a set of data points U =  (u1, u2,..., uM) in an M-dimensional space, and the data collected in a certain sampling period can form a matrix:

(1)

Where, t1, t2, … tN is the sample time. Use the sliding window model to process the data stream, as shown in definition 1.

Definition 1: The Sliding Window Model is to intercept a window with a length of | W | from the sensor data stream, and divide the window into m small blocks, namely Block1, Block2,..., Blockm, and the length of each block is n. When the data of tnext at the next sampling time enters the sliding window, the data of tlast at the last sampling time will be replaced.

(2)

Here, mod (a, b) is the remainder function. All sensors on the same sensor node use sliding windows to process data flow internally as Fig 2. shown.

Assuming that the data {uj (t1), uj (t2),..., uj (tp)} of the first p sampling times of sensor node j are loaded into the sliding window, then the variance of this group of data is

(3)

Here, is the average value of the jth dimension data collected by the corresponding sensor in the sliding window. When the new data uj(tp + 1) is loaded into the sliding window, the window slides backward, and the data in the window is updated to {uj (t2), uj (t3),..., uj (tp + 1)}, and the corresponding calculation formula of sampling data variance can be expressed as:

(4)

The data at the subsequent sampling time can be deduced in turn.

Definition 2: Coherence coefficient

(5)

Here, xji and yjk represent the jth value in the time series of any two data streams Xi and Yk, respectively. The coherence coefficient ρ is an important indicator to evaluate the coherence of multidimensional data flow. If ρ< 0, there is a negative correlation between data streams. If ρ> 0, the data stream is positive coherent. If ρ= 0, data flow has no coherence relationship.

In actual processing, if the sliding window model with width n is used to analyze the coherence of multiple data streams, the covariance matrix S11 of data streams Xi =  (X1i, X2i,..., Xpi) needs to be calculated, the covariance matrix S22 of data streams Yi =  (Y1i, Y2i,..., Ypi) needs to be calculated, the covariance matrix S12 of data streams Xi and Yi. as shown in formula 6. After standardizing the data, the correlation coefficient of the sample corresponds to the covariance of the sample. Finally, the corresponding typical correlation coefficient and typical correlation variable can be obtained by using the chi-square test method. For example, when a fire occurs in a CCL carriage, the sensor nodes collect temperature values that have a significant positive correlation with CO2 concentration values and a negative correlation with humidity values. Therefore, our research on the coherence and spatiotemporal correlation between multidimensional data in wireless sensor networks can provide a theoretical basis for accurate and efficient anomaly detection. On this basis, this article proposes a method for anomaly detection of WSN multimodal data streams.

(6)

Classification and judgment of main abnormalities in CCL

The main reasons for the abnormal data generated by the CCL mobile terminal include: (1) Some specific events occurs in the mobile terminal, for example, in case of water leakage in the movie carriage, the temperature reading of the sensor will decrease significantly, which is called an environment event; (2) When the node’s circuit cannot work normally, all parameters reading data of the node will abnormal, which is called a node event; (3) The data collected by the node deviates from the normal data because of the influence of external factors, which is called a measurement event. Abnormal data derived from specific events often reflect that some significant changes have occurred, which needs to be dealt immediately. However, the abnormal data caused by sensor node, which needs to be maintained. Because the data from the measurement anomaly cannot represent the actual environmental characteristics, in order to make accurate judgments, it is necessary to detect the data collected by the wireless sensor network in order to find the abnormal data in time and analyze and identify its source.

The measured value of the sensor should accurately reproduce the actual environmental characteristics, so the measured value uj(ti) should fluctuate slowly within a certain range in a stable environment, and when there is an abnormality, there will be a significant deviation in a short time. If uj(ti) meets equation (7), then the measured value may be abnormal data.

(7)

Here, t is the sample time. Eej(t) is the mathematical expectation of the measured value of the normal working sensor in the event area. Enj(t) is the mathematical expectation of the measured value in the normal area. It is generally considered that Enj(t) is a constant under stable conditions. The values of Eej(t) and Enj(t) in different environments are different, which is determined by the data set.

When the sensor fails (energy is exhausted or damaged and cannot work normally), the same data may be continuously generated at different sampling times, that is

(8)

The above two cases are called the judgment conditions to determine whether the sensor data is abnormal, and calculate the abnormal probability Pj(ti) of single-mode data flow based on this.

(9)

Here, constant k represents the number of times {uj(ti)} satisfies the judgment condition. c is a parameter. If uj(ti) continuously meets the judgment conditions at several sampling times, then k will increase gradually from 0. At this time, Pj(ti) and k are exponential. If uj(ti) does not meet the judgment condition, k, Pj(ti-1) and Pj(ti) are cleared at the same time. When uj(ti) meets the judgment condition, k starts to accumulate again.

The sensor node has a variety of sensors, so at a certain sampling time, the sensor node has multiple modes of data flow, generating multiple sets of Pj(ti). It is not accurate to judge the cause of data anomaly only through a single mode data flow. It needs to fuse multi-mode data flow for analysis and judgment. The multi-mode exception probability PT(ti) can be calculated from the multi-group single-mode exception probability Pj(ti) value as follows:

(10)

Where λj is the weight coefficient. In consideration of λj is related to the fluctuation range of the data, which can be consistent with the standard deviation ratio of the data in proportion, that is

(11)

When the PT(ti) of a sensor node reaches the threshold , it is considered that the node may have an exception. Here, Rth is set as the weighted average of the mean of the multimodal data set. Next, we need to use the spatial correlation of nodes to determine the type of exception. The sensor should receive the PT(ti) of the neighbor node. According to the Pauta criterion, if the PT(ti) of the node meets (μ and are mean and standard deviation of PT(ti) of neighbor node respectively), then it is considered that the error comes from the random error in the event process, and the state of the node is consistent with that of neighbor node; If not, it is considered that the status of this node is inconsistent with that of the adjacent node, and there is a fault node event or measurement event. The value of δ depends on the specific situation. Generally, the event can be regarded as a Bernoulli process with normal distribution of random variables, so the random variable can be simplified as a random variable with standard normal distribution.

(12)

Here, Φ(δ) is the standard normal distribution function. It can be seen from the table when Φ(δ) > 0.975, p<0.05. when δ is more than 1.96, Φ(δ) > 0.975, so we can take δ=2.

Then, three kinds of the abnormal event can be judged as follows:

If and , it can be considered an environment event;

If and , it can be considered a node event;

If the above conditions are not met, it can be considered as a measurement event. Next, we need to detect and filter the measured data. In the actual CCL system, the measurement event is the most important exception among three abnormalities.

Anomaly detection based on improved iForest algorithm

The node with measurement error has first eliminated the possibility of node failure event and environment abnormal event, and should be a node that can work normally. However, there are data in the collected data stream that are significantly different from the actual environmental characteristics, so it is necessary to detect the nodes with measurement errors and find out the abnormal data, so as to improve the reliability of the CCL system.

Isolation forest algorithm is widely used in anomaly detection, because its computing speed is fast, unlike clustering-based algorithms such as K-means, which spend a lot of time computing distance, and its robustness is strong. The steps of the algorithm are as follows:(i) People randomly select a part of the samples from all the data as a set of isolated trees, randomly select one dimension and one segmentation point for segmentation, and divide the data into two subspaces in this dimension, that is, less than the segmentation point or greater than or equal to the segmentation point. (ii) People continuously select dimensions and segmentation points randomly, and operate repeatedly until there is only one sample data in a certain subspace, or all attribute values in the subspace are the same (cannot be segmented), or the height of the previously preset tree has been reached. When any one of the above three conditions is met, the construction of the orphan tree is stopped. (iii) The above N trees are built to form an isolated forest. (iv) The sample score is constructed, that is, after the final isolated forest is formed, the position of the tree where each sample is located is scored, and the total score is obtained. The closer the score is to 1, the more likely it is to be abnormal. The closer it is to 0, the more likely it is to be normal. When the score is close to 0.5, it is impossible to judge whether it is abnormal or normal. The schematic diagram of the algorithm is shown in Fig 3.

However, the disadvantage of this algorithm is also obvious, and it is easy to fall into local optimum, which is attributed to randomly selecting arbitrary one-dimensional data for cutting. To overcome this shortcoming, this paper proposes an improved isolated forest algorithm based on full feature fusion.

In the process of building the traditional isolated forest model, because each time the data space is cut into subtrees, a feature is randomly selected from all features, so more feature information may not be used after all features are built, resulting in low accuracy of the algorithm. To solve this problem, the feature set of the data set is cross grouped during the process of building the isolated forest model. Define Q as the all feature set of the data set, and n empty feature subset Qi (i = 1,2... n) of Q. The number of elements in feature set Q is m. The number of elements in the feature subset Qi is l. r is defined as the cross factor, which represents the proportion of the number of features contained in each feature subset in the number of all features in the data set, that is, r = l/m, with a value of 0 ~ 1.0. Then put each feature in the feature set into the feature subset in turn, such as putting the first feature in Q into Q1,… Qm, the second feature into Q2,...Qm + 1, the i-th feature into Q(i-1)mod n + 1,...Q(i + m-2)mod n + 1, and the i + 1 feature into Qi mod n + 1,...Q(i + m-1)mod n + 1. A certain number of isolated trees are constructed on each feature subset after cross grouping. The model construction process is as follows:

Step 1: Input the feature data set and initialize the parameters of the iForest for balanced modeling of full feature information. Here, set the number of iTrees to 100, the sub-sample size to 256, and the tree height to 8.

Step 2: The sub-sampling algorithm is used to realize the sub-sampling of the data set, and the set composed of all features of the data set after sub-sampling is divided according to the cross-grouping rules. Let all feature sets of the dataset be Q, and the number of elements in the set be m. After cross grouping, multiple feature subsets are generated, which are Q1, Q2, …, Qn. The number of elements in each subset is l, and all feature subsets should meet the condition Q1 ∪ Q2∪⋯∪Qn = Q, cross factor r = l/m, adjusted by fixed n according to the test effect of verification set.

Step 3: Build iTree using the above sample space. During the construction process, each time the data space is cut for subtree division, the feature values are selected from the feature set elements in step 2. After determining the selected feature, find the value range of the current feature, and randomly select a split value within the range to complete the subtree division of the data space. Repeat the process until the termination condition is reached, and complete the construction of the isolation tree.

The pseudocode of the iTree is summarized in the Table 1.

Isolated forests is built based on isolated trees. To ensure differences between different trees, a method of randomly sampling partial datasets is used to construct each isolated tree. A certain number of isolated trees are built for each feature subset during the process of building isolated forests. Finally, multiple sets of isolated trees generated based on different feature subsets will be integrated together to form an isolated forest. The pseudocode of the isolated forest is summarized in the Table 2.

Isolated forests is built based on isolated trees. To ensure differences between different trees, a method of randomly sampling partial datasets is used to construct each isolated tree. A certain number of isolated trees are built for each feature subset during the process of building isolated forests. Finally, multiple sets of isolated trees generated based on different feature subsets will be integrated together to form an isolated forest.

The final judgment is determined by the abnormal score. The abnormal score is defined by expression (13).

(13)

The value of s(x,n) is [0, 1.0]. The closer the value of s(x,n) is to 1.0, the more isolated the sample is, and the more possible it is to be abnormal. The average path evaluation method of sorted binary tree is used to normalize the results. The c(n) is defined as follows:

(14)

here, , ξ is Euler constant, it is 0.5772156649. h(x) is the path length of the test data traversing a single isolated tree, and E(h(x)) is the average path length of the test data traversing each isolated tree. The pseudocode of the length calculation is summarized in the Table 3.

In the construction process of isolated tree, the selection range of features is all elements in the subset, which effectively improves the balance of the use of feature information in the construction process of isolated tree and realizes the full use of data feature information. This method not only overcomes the shortcomings of high dimension, but also does not increase the workload of the algorithm. It only needs to try the value of the cross factor. It is a fast and efficient improved isolated forest algorithm.

Experiment and results

This section describes the experiments. First, a brief description of experimental data and environment is given. Then four experimental results are given, including the parameter selection of the optimal sliding window length and the optimal cross factor, the performance comparison about improved iForest and iForest in P, R, F1 score and AUC, the performance comparison about improved iForest, iForest, SVM and LOF in AUC and running time.

Data and environment

The data collection equipments were placed on three CCL vehicles of a CCL company in Ningbo, Zhejiang Province, China, to collect real-time sensor data from the mobile vehicles and conduct experiments from December 1st to December 7th, 2023. There are 30 nodes in each compartment, and each node has 5 sensors to detect the temperature, humidity, concentration of CO2, concentration of O2 and pressure near the node as shown in Table 4. Therefore, there are 150 dimensional data streams in each carriage. Each sensor node sends detection parameters to the sink node every minute. So, the experimental dataset is a 150 dimensional real-time data stream with characteristic parameters including temperature, humidity, oxygen concentration, carbon dioxide concentration, and pressure values. The data for each feature parameter is 40 bits. So the dataset in each carriage is 30 Mbits. The experiment was carried out in a Lenovo desktop computer with Windows 11 operating system, which was configured with Intel Core i7 CPU and 16 GB memory.

thumbnail
Table 4. Vehicle information and parameters in the experiment.

https://doi.org/10.1371/journal.pone.0315322.t004

Performance indicators

Precision P, recall R, F1 score, AUC and algorithm executing time are selected as the indicators to evaluate the performance. Confusion matrix is shown as Table 5. The purpose of the confusion matrix is to compare actual and predicted labels. The terms TP (true positive) and TN (true negative) denote correctly predicted conditions and FP (false positive) and FN (false negative) misclassified ones. TPs and TNs refer to correctly classified abnormal and normal records, respectively and, conversely, FPs and FNs refer to misclassified normal and abnormal records, respectively.

thumbnail
Table 5. Confusion matrix for calculating the abnormal detection.

https://doi.org/10.1371/journal.pone.0315322.t005

Based on the confusion matrix model, the P represents the proportion of samples that are truly positive in the samples that are predicted to be positive. The greater the accuracy, the better the prediction performance. P is defined as follows:

(15)

R refers to the ratio of the number of predicted positive samples to the total number of positive samples. The higher the R is, the more positive samples are predicted correctly and the better the prediction performance is. R is defined as follows:

(16)

F1 score is a harmonious mean of P and R, that is, a statistical function for estimating the accuracy of a system by computing its P and R given as:

(17)

ROC (receiver operating characteristic curve) was created from the signal processing theory and then extended to other domains, such as data mining and machine learning as well as artificial intelligence. ROC curve is based on a series of different binary classification methods. FPR is the false positive rate, TPR is the true rate, and the area covered by the curve is defined as AUC. ROC curve and AUC are commonly used to evaluate the advantages and disadvantages of classification models, and the larger the AUC, the better the prediction effect.

The running time of the algorithm is one of the important indicators to measure the performance of the algorithm. The running time of the algorithm includes model training time and model verification time.

Results and analysis

First, we need to determine the most optimal legnth of the sliding window. In terms of accuracy, the larger the window length, the higher the accuracy. However, the larger the window length is, the larger the calculation amount is. In order to investigate the impact of sliding windows of different lengths on the statistical characteristics of data flow, 10000 data groups of one-dimensional temperature, CO2 concentration, O2 concentration, humidity and pressure were selected and calculated with the square difference of sliding windows of different lengths. The results are shown in Table 6. The variance of data flow tends to be stable with the increase of window length. It can be seen from Table 6 that when the length of the sliding window is greater than 200, the variance tends to be stable. Therefore, the best length of the sliding window here is 200.

thumbnail
Table 6. Variance of data stream of different sliding window length.

https://doi.org/10.1371/journal.pone.0315322.t006

Then we study the optimal value of the cross factor r. According to reference [1], number of iTree is set to 100, number of sub samples is set to 256, number of cross groups is set to 10, cross factor r varies from 0.1 to 1.0, and step length of cross factor is set to 0.1. The relationship between P, R, F1 score, AUC and cross factor r are shown in Fig 4. The improved iForest algorithm is better than the original algorithm in P, R, F1 score and AUC when r is 0.5 ~ 0.9. When r is 0.6, the above performance is optimal. In this experiment, different cross factors determine the number of features in the feature subset of the isolated tree. When r is less than 0.4, the number of elements in the grouped feature subset is too small, resulting in insufficient information contained in the isolated tree and poor detection effect. When r is 0.5 ~ 0.9, the isolated tree constructed based on the grouped feature subset realizes the balanced utilization of feature information, and the detection effect is improved. From the line graph, it can be seen that when r is set to 0.6, the detection effect is the best; When r is 1.0, each isolated tree is constructed using all features. So the cross factor is set to 0.6 in the later experiments.

thumbnail
Fig 4. The performance indicators P, R, F1 score, AUC under different cross factor r.

https://doi.org/10.1371/journal.pone.0315322.g004

Thirdly, we take the comparison experiments about main algorithms currently used. In addition to the iForest algorithm, SVM and LOF are also the most commonly used anomaly detection algorithms. We tested the performance of the four algorithms with the three vehicle A, D and F in a local CCL company. There are 30 nodes in a van. We selected 5, 10, 15, 20, 25 and 30 nodes for experiments to verify the impact of different dimensional data streams on the experimental results. Table 7 shows the performance indicators with P, R, F1 score and AUC of each vehicle with 5 nodes to collecting the sensors data streams. Obviously, the proposed algorithm do not have particularly obvious advantages.

thumbnail
Table 7. The comparison of performance indicators of four algorithms with 5 nodes.

https://doi.org/10.1371/journal.pone.0315322.t007

Table 8 shows the performance indicators with P, R, F1, and AUC of each vehicle with 10 nodes. It is clear that the iForest algorithm and the improved iForest algorithm have some advantage in performance indicators when the dimensionality of the data stream increasing.

thumbnail
Table 8. The comparison of performance indicators of four algorithms with 10 nodes.

https://doi.org/10.1371/journal.pone.0315322.t008

Table 9 shows the performance indicators with P, R, F1, and AUC of each vehicle with 15 nodes to collecting the sensors data streams. It is clear the improved iForest algorithm have obvious advantage in performance indicators. And the performance indicators of the improved iForest algorithm are better than those of the iForest algorithm.

thumbnail
Table 9. The comparison of performance indicators of four algorithms with 15 nodes.

https://doi.org/10.1371/journal.pone.0315322.t009

Table 10 is shown the performance indicators with P, R, F1, and AUC of each vehicle with 20 nodes. It is clear that the proposed algorithm have obvious advantage in performance indicators among all algorithms.

thumbnail
Table 10. The comparison of performance indicators of four algorithms with 20 nodes.

https://doi.org/10.1371/journal.pone.0315322.t010

Table 11 and Table 12 show the performance indicators with P, R, F1, and AUC of each vehicle with 25 and 30 nodes to collecting the sensors data streams. It is obvious that the performance indicators of the proposed algorithm are better than other algorithms, because as the dimension increases, the computation of SVM and LOF algorithms will sharply increase, and iForest algorithm is prone to falling into local optima.

thumbnail
Table 11. The comparison of performance indicators of four algorithms with 25 nodes.

https://doi.org/10.1371/journal.pone.0315322.t011

thumbnail
Table 12. The comparison of performance indicators of four algorithms with 30 nodes.

https://doi.org/10.1371/journal.pone.0315322.t012

The average performance indicators of different nodes under four algorithms are shown in Fig 5. When the dimensionality is low, the advantages of the proposed algorithm are not significant, while as the dimensionality increases, the performance indicators of the proposed algorithm are significantly higher than the other three algorithms.

thumbnail
Fig 5. Comparison of performance indicators with different node number.

https://doi.org/10.1371/journal.pone.0315322.g005

It is obvious that the iForest algorithm and the proposed algorithm in this paper do not have significant advantages, when the dimensionality of the data stream is low, such as when the number of nodes is 5, and some performance indicators are not as good as those of LOF and SVM algorithms. When the dimensionality of the data stream increases, the iForest algorithm and the proposed scheme have significant advantages in performance metrics, such as when the number of nodes is greater than 10. The performance of iForest and the proposed algorithm are very similar when data dimensions are low. When the data dimension is large, the performance indicators of the proposed scheme is better than that of iForest, because it can overcome the disadvantage of iForest algorithm being easily limited to local optima in high dimensions. When the number of nodes is 30, the P is 0.8779, R is 0.8836, F1 score is 0.8630, AUC is 0.9075, which is more better than those of other three algorithms.

The ROC curves of the four algorithms with 30 and 5 nodes are compared in Fig 6 and Fig 7. Similar to the four performance indicators mentioned above, when the data dimensionality is low, the ROC of the proposed algorithm is not significantly better than the other three algorithms, while when the data dimensionality is high, its ROC is significantly better than the other three algorithms.

thumbnail
Fig 6. Comparison of ROC curves for four algorithms with 30 nodes.

https://doi.org/10.1371/journal.pone.0315322.g006

thumbnail
Fig 7. Comparison of ROC curves for four algorithms with 5 nodes.

https://doi.org/10.1371/journal.pone.0315322.g007

The executing time comparison of the four algorithms is shown in Table 13. The executing time of the improved algorithm is a little more than that of the iForest, but it is significantly less than that of the SVM algorithm and LOF algorithm. This is because the computation amount of improved algorithm and iForest is O (n), while that of SVM and LOF algorithm is O (n2).

thumbnail
Table 13. The comparison of execution time of four algorithms.

https://doi.org/10.1371/journal.pone.0315322.t013

Discussion

Through the experiments aboved, three issues should be discussed.

First of all is the slide window length. The optimal length 200 of the sliding window for multidimensional data streams is determined in experiments based on actual measurements. The value should be different in different experimental environments. Although theoretically, the longer the legnth of the sliding window, the smaller the calculation error. But considering the implementation cost, a suitable value should be chosen.

The second issue is the value of crossover factor r. The value of the cross factor r determines the number of features in the feature subset when constructing an isolated tree. When the value of the cross factor r is small, the number of elements in the grouped feature subset is too small, resulting in insufficient information contained in the constructed isolated tree and poor detection performance; When the value of the crossover factor r is large, the isolated tree constructed by the grouped feature subset achieves balanced utilization of feature information, and the detection effect is improved; The maximum value of the crossover factor r is 1.0, which means that each isolated tree is constructed using all features. The experimental results show that feature cross grouping can improve the P, R, F1 score, and AUC of the iForest in CCL anomaly detection. The best crossover factor r is 0.6 according to the experiment. The proposed algorithm can fully utilize the feature information of the data, improve the balance of feature utilization in model construction, and have better detection performance.

The third issue is the impact of data dimensionality on algorithm performance. The proposed algorithm is more suitable for high-dimensional situations. From the experiment, it can be seen that the higher the dimension, the more obvious its performance advantage. In low dimensional situations, there is no obvious advantage. The accuracy P, R, F1 score, and AUC of the proposed algorithm are 87.79%, 88.36%, 86.30%, and 90.75% respectively when the number of nodes is 30. In terms of executing time, both the proposed algorithm and the iForest have the shortest executing time, indicating that the algorithm has low computational complexity, fast running speed, and high efficiency.

Conclusions

In this paper, through the detailed analysis of the characteristics of multidimensional data stream, we put forward the concept of Sliding Window Model and Coherence Coefficient to describe and establish mathematical model for multi-dimensional data stream, and three main abnormal events including environment event, node event and measurement event are derived by the mathematical model, and then an improved isolation Forest algorithm is proposed for the measurement event which is the most frequent abnormal events in CCL. The algorithm combines several dimensional data with the cross factor, so that each dimension contains all the feature vectors, which overcomes the disadvantage of being easily trapped in local optimization, and does not increase a lot of computation with the sliding window. Experimental analysis shows positive results and demonstrates the system effectiveness against current three main types of current CCL vehicles. The proposed scheme in this paper can not only be used for anomaly detection in cold chain logistics, but also for anomaly detection in other multidimensional real-time data streams, especially in situations where there is strong temporal and spatial correlation between multidimensional data streams.

Although the proposed scheme and algorithm are robust to CCL, further improvements are still possible. For instance, the selection of sliding window length is obtained statistically in this experiment, and this parameter is involved in determining the accuracy of the algorithm and the executing time. In addition, for low dimensional data streams or datasets, proposed scheme does not have significant advantages and the performance indicators do not improve much. Thus, future work will focus on further increasing the peformance of all conditions.

References

  1. 1. Han J-W, Zuo M, Zhu W-Y, Zuo J-H, Lü E-L, Yang X-T. A comprehensive review of cold chain logistics for fresh agricultural products: Current status, challenges, and future trends. Trends in Food Science & Technology. 2021;109:536–51.
  2. 2. Hou L, Li GY. Research on Warehouse Safety Warning System Based on Multi-Sensor Fusion. AMM. 2014;556–562:5640–3.
  3. 3. Mariusz K,“Securing of safety by monitoring of technical parameters in warehouse racks, in high-bay warehouses and high storage warehouses – literature review of the problem,” LogForum, vol. 13, no. 2, pp:125-134, May. 2017. https://doi.org/10.17270/J.LOG.2017.2.1
  4. 4. Zhao C, Ding H. Cold chain logistics surveillance system based on multi-dimensional information sensing. Journal of Zhengzhou University (Natural Science Edition). 2020;52(1):54–9.
  5. 5. YANG Wei, “Research on dairy cold chain logistics pre-warning,” China Dairy Industry, vol. 46, no. 7, pp:50-55, Aug. 2018.
  6. 6. Wang jian qiang. Effects of Cold Chain Logistics on Meat Freshness and Intelligent Detection. Packing Engineering. 2022;43(1):148–57.
  7. 7. Cheng R qi, Chen S. Design of fresh food sensory perceptual system for cold chain logistics. Storage and Process. 2018;18(4):136–40.
  8. 8. Wang Shen Xiang. Tilapia cold-chain logistics abnormal temperature detection method in the study. Science Technology and Engineering. 2017;17(19):177–81.
  9. 9. Feng H, Wu M mei, Yang J. Real time monitoring system of fruit and vegetable cold chain logistics based on ZigBee technology. Jiangsu Agricultural Sciences. 2017;45(6):219–21.
  10. 10. Liu C. Research progress on PCCSM used in cold chain logistics, new chemical materials. New Chemical Materials. 2021;49(2):16–9.
  11. 11. Witjaksono G, Saeed Rabih AA, Yahya N bt, Alva S. IOT for Agriculture: Food Quality and Safety. IOP Conf Ser: Mater Sci Eng. 2018;343:012023.
  12. 12. Gogou E, Katsaros G, Derens E, Alvarez G, Taoukis PS. Cold chain database development and application as a tool for the cold chain management and food quality evaluation. International Journal of Refrigeration. 2015;52:109–21.
  13. 13. Azzi R, Chamoun RK, Sokhn M. The power of a blockchain-based supply chain. Computers & Industrial Engineering. 2019;135:582–92.
  14. 14. Tsang, Y. P., Choy, “An Internet of Things (IoT)-based risk monitoring system for managing cold supply chain risks,” IMDS. 2018;118(7):1432–62.
  15. 15. Han J-W, Li Q-X, Wu H-R, Zhu H-J, Song Y-L. Prediction of cooling efficiency of forced-air precooling systems based on optimized differential evolution and improved BP neural network. Applied Soft Computing. 2019;84:105733.
  16. 16. Hussain A, Pu H, Sun D-W. Innovative nondestructive imaging techniques for ripening and maturity of fruits – A review of recent applications. Trends in Food Science & Technology. 2018;72:144–52.
  17. 17. Mohd Ali M, Hashim N, Aziz SA, Lasekan O. Emerging non-destructive thermal imaging technique coupled with chemometrics on quality and safety inspection in food and agriculture. Trends in Food Science & Technology. 2020;105:176–85.
  18. 18. Pereira CG, Ramaswamy HS, Giarola TM de O, de Resende JV. Infrared thermography as a complementary tool for the evaluation of heat transfer in the freezing of fruit juice model solutions. International Journal of Thermal Sciences. 2017;120:386–99.
  19. 19. Xiao X, He Q, Fu Z, Xu M, Zhang X. Applying CS and WSN methods for improving efficiency of frozen and chilled aquatic products monitoring system in cold chain logistics. Food Control. 2016;60:656–66.
  20. 20. Hoang HM, Akerma M, Mellouli N, Montagner AL, Leducq D, Delahaye A. Development of deep learning artificial neural networks models to predict temperature and power demand variation for demand response application in cold storage. International Journal of Refrigeration. 2021;131:857–73.
  21. 21. Liu L, Liu X, Li W. Hierarchical network modeling with multidimensional information for aquatic safety management in the cold chain. Food Sci Nutr. 2018;6(4):843–59. pmid:29983947
  22. 22. Jedermann R, Geyer M, Praeger U, Lang W. Sea transport of bananas in containers – Parameter identification for a temperature model. Journal of Food Engineering. 2013;115(3):330–8.
  23. 23. Mobeen A, Joshi S, Fatima F, Bhargav A, Arif Y, Faruq M, et al. NF-κB signaling is the major inflammatory pathway for inducing insulin resistance. 3 Biotech. 2025;15(2):47. pmid:39845928
  24. 24. Pang G, Shen C, Cao L, Hengel AVD. Deep Learning for Anomaly Detection. ACM Comput Surv. 2021;54(2):1–38.
  25. 25. Xu X, Liu H, Yao M. Recent Progress of Anomaly Detection. Complexity. 2019;2019(1):.
  26. 26. Jeong M, Lee N, Ko BS. Ensemble deep learning model using random forest for patient shock detection. KSII Transactions on Internet and Information Systems. 2023;17:1080–99.
  27. 27. Tang J, Zou Y, Xie R, Tu B, Liu G. Compact supervisory system for cold chain logistics. Food Control. 2021;126:108025.
  28. 28. Wu W, Shen L, Zhao Z, Harish AR, Zhong RY, Huang GQ. Internet of Everything and Digital Twin enabled Service Platform for Cold Chain Logistics. J Ind Inf Integr. 2023;33:100443. pmid:36820130
  29. 29. Meng X, Xie R, Liao J, Shen X, Yang S. A cost-effective over-temperature alarm system for cold chain delivery. Journal of Food Engineering. 2024;368:111914.
  30. 30. Zhan X, Wu W, Shen L, Liao W, Zhao Z, Xia J. Industrial internet of things and unsupervised deep learning enabled real-time occupational safety monitoring in cold storage warehouse. Safety Science. 2022;152:105766.
  31. 31. Tagliavini G, Defraeye T, Carmeliet J. Multiphysics modeling of convective cooling of non-spherical, multi-material fruit to unveil its quality evolution throughout the cold chain. Food and Bioproducts Processing. 2019;117:310–20.
  32. 32. Abdella A, Brecht JK, Uysal I. Statistical and temporal analysis of a novel multivariate time series data for food engineering. Journal of Food Engineering. 2021;298110477.
  33. 33. Badia-Melis R, Mc Carthy U, Ruiz-Garcia L, Garcia-Hierro J, Robla Villalba JI. New trends in cold chain monitoring applications - A review. Food Control. 2018;86:170–82.