Spatial-Temporal Congestion Identification Based on Time Series Similarity Considering Missing Data

Traffic congestion varies spatially and temporally. The observation of the formation, propagation and dispersion of network traffic congestion can lead to insights about the network performance, the bottleneck dynamics etc. While many researchers use the traffic flow data to reconstruct the congestion profile, the data missing problem is bypassed. Current methods either omit the missing data or supplement the missing part by average etc. Great error may be introduced during these processes. Rather than simply discarding the missing data, this research regards the data missing event as a result of either the severe congestion which prevent the floating vehicle from entering the congested area, or a type of feature of the resulting traffic flow time series. Hence a new traffic flow operational index time series similarity measurement is expected to be established as a basis of identifying the dynamic network bottleneck. The method first measures the traffic flow operational similarity between pairs of neighboring links, and then the similarity results are used to cluster the spatial-temporal congestion. In order to get the similarity under missing data condition, the measurement is implemented in a two-stage manner: firstly the so called first order similarity is calculated given that the traffic flow variables are bounded both upside and downside; then the first order similarity is aggregated to generate the second order similarity as the output. We implement the method on part of the real-world road network; the results generated are not only consistent with empirical observation, but also provide useful insights.


Background
As the number of vehicles steadily increases and the expansion of roadways is relatively slow, traffic congestion is becoming a central transportation issue in big cities. The traffic congestion is a result of the imbalance between traffic supply and demand. The bottleneck is the weak point of the supply-demand structure. Thus generally congestion first begins from these bottlenecks and then propagates around the network. Sometimes the process is called cascading failures [1,2]. Such daily road network operation can be observed experimentally from traffic flow data.
Exploration of the network scale traffic flow data can provide insights to the understanding and management of road network. While static bottlenecks have gained enough attention and many research results [3], such dynamic network bottleneck are still attracting ongoing researches.
Exploring network traffic congestion from traffic flow data is not a new idea. Due to the deployment of ITS, traffic flow data becomes more and more common nowadays. Various methods are proposed to achieve the task above. For example, Anbaroglu, Heydecker, and Cheng [4] used link journey time to cluster the congestion. The link was considered as congestion when the travel time exceeds a threshold; Hu et al. [5] used the vehicle trajectory data to identify the congestion along a road. The velocity at a moment was indicated by the gray pixel in the image. Then the congestion was identified by image processing method; Bauza and Gozalvez [6] proposed the congestion detection method based on vehicle-to-vehicle technology; Wang Lu et al. [7] visualized the traffic jam in Beijing using GPS trajectory data. The jam identification was realized by applying a speed threshold. And the data was assumed complete; Damaiyanti, Imawan, and Kwon [8] using a similar way to indicate the congestion as a heat map style. Due to large size of the data, map reduce framework was implemented. Another type of method is based on kernel density estimation (KDE) method. Li et al. [9] used FCD (floating car data) to estimate the kernel density of accessibility of POI (points of interest). The accessible cell was confined by the travel time and the predefined bandwidth. Schrank, Lomax, and Crum [10] developed a method to identify the Worst Bottlenecks in Texas State. The method used traffic speed data, roadway geometry, and traffic counts to calculate congestionrelated performance measures. Performance measures such as annual delay per mile, congestion cost, and the Travel Time Index are produced from this analysis and were used to rank the congested segments across. John, Eisele, and Schrank [11] developed a method that can group links with similar traffic state using speed data. Pairs of links speed plot was established to get the similarity between adjacent links, in which similar traffic operation will result in linear relationship. Xu, Yue, and Li [12] proposed a multi-dimension analysis framework of traffic congestion based on historical floating car data set. The congestion event was identified by some spatial and temporal threshold; Wang, He, Stenneth et al. [13] proposed a traffic congestion estimation framework by using Twitter as the data source. Common twitter users' traffic information were selected by matching at least one term of the predefined congestion vocabularies. Another kind of method that needs to be noticed is data cube [14], where traffic flow data are organized into a cube shape, with spatial-temporal dimensions, therefore different levels of analysis can be applied.

Conclusion of Review
Although various methods including simulation results [15] are proposed to identify, estimate and descript the evolution of traffic congestion, still one problem remains unanswered: how to identify the bottleneck of the surface street network or urban road network. The congestion itself is the result of bottleneck dynamics. Reconstruction of bottleneck requires dynamic traffic flow data, which cannot be in great time scales such as hourly data, nor in small scales such as seconds, where great computation efficiency will be a big problem. Even under this condition, still many difficulties exist that can not be passed by. A typical example is the data missing problem. Traditionally the method either ignores the data, or supplements the missing part using average and similar data. It may create more errors. Moreover, the missing data may not just a decrease of information. For example, when there is no data in the hot-spot of CBD area, one possibility is that the congestion is so severe that no floating car can enter.
The purpose of this research is to answer the above question by a time series similarity measure approach. The idea is straightforward: the traffic data can be seen as many time series.
And for those links in the same congestion area, the trend should be similar, i.e. the velocity decrease or restore at the similar pace. Strangely such characteristic is neglected in many researches, where just the threshold is applied to define the congestion.
The whole process of bottlenecks identification includes mainly two components: • A time series similarity measure component, which is used to deal with the data missing problem; • A clustering algorithm component, which identifies the spatial-temporal congestion by a metaheuristic way. The clustering is implemented in three levels, from homogeneous traffic state to single step heterogeneous condition, then to multi-steps heterogeneous condition.
The following chart of the method is as in Fig 1:

Notations
The notations are listed in Table 1.

Time Series Similarity Considering Missing Data
The Model expressed as NaN, which means "Not a Number". Our objective is to calculate the similarity between θ 1 and θ 2 : When there is no missing data, a variety of methods can be applied, such as cosine similarity, Euclid distance similarity, etc. We define the similarity under this case as normal similarity: Actually there are many methods that can deal with missing data, for example we can just omit the missing part of a series (the corresponding data which is not missing in another series is omitted either); or supplement the missing data with for example the average across all possible values. Also we can use DTW (dynamic time wrapping) to deal with the two series with different length. However such method neglect the information behind the missing data. For example in the data set used where the data source is taxi GPS, the missing data may due to the severe congestion which impede the taxi to enter the link, or due to the low demand of the link/area. Simply drop the data or supplement it with mean may lose the underlying information and create great error.
Here we develop a method that can deal with missing data type time series similarity measurement. Since traffic parameters have its own limits such as velocity limit, we denote the up bound as t and down bound as t. It means that any value in the time series are alway greater than t and smaller than t.
In order to get the similarity measure between two time series, θ 1 and θ 2 , both with the length K, with different number of missing data, we carry out the following process:   1. Abstract the elements in the original time series in pairs, for example we can abstract (θ 1m , θ 1n ) and (θ 2m ,θ 2n ) for all m and n that satisfy m 6 ¼ n. Thus there are pairs. K is the length of the series.
2. With respect to each pair, (θ 1m ,θ 1n ) and (θ 2m ,θ 2n ), calculate the similarity measure s 0 mn = s 0 ((θ 1m ,θ 1n ),(θ 2m ,θ 2n )) of this pair based on the following five cases. Let p 1 = (θ 1m ,θ 1n ) and p 2 = (θ 2m ,θ 2n ). The similarities are all expressed as an interval: a. Case 1: All the data in the pair are not missing, then the similarity is: b. Case 2: Only one element in this pair is missing. Under this condition, there are mainly two cases: θ 1m = NaN or θ 1n = NaN. The similarity is calculated as follows: i. Case 2.1 θ 1m = NaN We construct the vector p 1 ¼ ðy 1 ; y 1n Þ and p 1 ¼ ðy 1 ; y 1n Þ. The up line and down line mean the up-bound and down-bound of the data. And since θ 1m = NaN, then the possible similarity between the pair is a interval, or a vector, not a scalar.
In order to get the interval, firstly suppose e = (1,0), which is considered as unit vector.
And then based on the simple geometry, the angle between vector p 1 and the x-axis is between the interval of two angles: And the angle between vector p 2 and the x-axis is: If ðA 2 À A 1 ÞðA 2 À A 1 Þ 0 it means that the vector A 2 is between the vector A 1 and A 1 , which makes the sign of A 2 À A 1 and A 2 À A 1 opposite, thus the maximum possible cosine similarity is 1 (the two series overlap) and the minimal similarity possibil- means that vector A 2 lie outside the vector A 1 and A 1 , and the minimal and maximal possible values of similarity are minfcosðA 2 À A 1 Þ; cosðA 2 À A 1 Þg and maxfcosðA 2 À A 1 Þ; cosðA 2 À A 1 Þg respectively. Therefore we get the cosine similarity interval in (7).: Surely we can also use other types of similarity measure instead of cosine similarity. For example we can examine the Euclid distance between the fixed point p 2 and the line between point p 1 and point p 2 , but the basic principle is the same.
The underlying idea of above equation is shown in the following Fig 2. ii. Case 2.2 θ 1n = NaN With respect to this type of missing data, it is similar to case 3.1. We just need to interchange the axis of 'x' and 'y' and construct another pair of p 1 and p 2 . The formulation of (7) is also applicable.
c. Case 3 There are two missing data in the pair. Under this case, there are mainly three types: i. Case 3.1 θ 1m = NaN and θ 1n = NaN Under this case the encompassing vectors that contain the vector p 1 are p 1 ¼ ½y; y and p 1 ¼ ½y; y. Thus we get the interval of its angles: And the angle of vector p 2 is the same in Eq (6). The similarity measure under this condition is the same in (7).
ii. Case 3.2 θ 1n = NaN and θ 2m = NaN Under this condition, both vectors lose one data point. Let p 1 ¼ ðy 1m ; y 1n Þ and p 1 ¼ ðy 1m y 1n Þ. Similary we have p 2 ¼ ðy 2m ; y 2n Þ and p 2 ¼ ðy 2m ; y 2n Þ. And we can get the interval between the angle formed by vectors and x-axis as ½A 1 ; A 1 and ½A 2 ; A 2 , where: If there is an overlap between ½A 1 ; A 1 and ½A 2 ; A 2 , i.e. min ðA 1 ; A 2 Þ ! max ðA 1 ; A 2 Þ, the maximum similarity would be one. The maximum angle between the two vectors is max ðA 1 ; A 2 Þ À min ðA 1 ; A 2 Þ and the minimum angle between the pair is: Therefore we get the similarity interval: iii. Case 3.3 θ 1n = NaN and θ 2m = NaN Under this condition, we can also construct the minimum and maximum angle between vectors. Thus Eq (12) can also apply, as long as the variables within the equation are replaced.
After the above process, we get C 2 K ¼ KðKÀ 1Þ 2 first order similarity measurements; each is a two variable vector thus can be described uniformly as S mn ¼ f½s a mn ; s b mn g. Two time series from them then are constructed: s a mn ¼ fs a 11 ; s a 12 ; s a 13 . . . . . .g ð16Þ There is no missing data in the interval series. Many similarity measures can be applied to the intervals series. We call this the second-order similarity measures: An Example of Missing Data Similarity We randomly choose two adjacent links to give a presentation of the similarity measurement results. Note that we can set a time domain where the similarity measure is taken. When the similarity measurement is obtained, we move the time domain forward as in Fig 3 in a rolling horizon manner. In this way the variation of time series can be observed. The step and time domain length can both be changed. There are also some features of gathering phenomena of the data, i.e. the data missing moments are relative close, mainly at peak hours, around about 06:00~07:00 and 16:00~17:00. The up bound and down bound of velocity is set based on the rank of the road. The major arterial road is set to 60km/h. Speed limit for minor arterial road and major branch are set to 50km/h and 40km/h respectively. Down bound of the velocity is set to zero.
The figures give three different time domain length, 1, 2 and 3 hours respectively. Rolling steps are all set to 0.5 hour. The velocity itself displays an irregular nature. From the trend of the velocity curve it is hard to decide at what moment is peak hour and at what moment is offpeak hour. But the missing data points appear mostly at peak hour based on the intuitive  observation. This may be due to the congestion which prevents the vehicles from entering the links. When there is missing data, the resulting similarity is smaller, since the possible interval of the velocity changes from down bound to up bound. This can be seen from the similarity when time domain length is 1 hour. The lower similarity corresponds to the peak hour, when the velocity of the road network is very low (according to the last but one section where velocity profile of the whole network is presented). Because the time domain length is 1 hour thus the similarity for the last hour (23:00~24:00) cannot be calculated since there is no enough data. When the time domain length increases to 2 hours and 3 hours, the similarity measurements can also be calculated. The whole trend still holds, but the similarity measurement is smoothed. The longer the time domain is, the smoother the similarity is.

Jam Clustering Method
A congestion event which covers many links will generate many similar time series. Vice versa, through the similarity between time series, we can reproduce the congestion profile based on traffic flow operational data. And during off-peak hours, random factors prevail more than peak hours. Thus similarity between peak hours is greater than that of off-peak hours. Given a similarity threshold, the area which shares similarity above the threshold can be sketched out. Combined with the operational data, the profile of a congestion can be derived.

General Model
Clustering is based on the similarity measures between neighboring links. Generally, the links within the same congestion area are similar in distribution thus leads to a greater similarity measure. Traditional clustering methods cluster all the objects into finite clusters. No one is left outside the clusters. And the core of the clustering algorithm is distance measurement, or similarity measurement between each pair of objects. However this traditional clustering framework is not suitable for our problem in the following aspects: • Not all of the objects, or links need to be clustered. Some links with very good traffic state don't belong to any congestion areas and due to their traffic performance no cluster requirement is needed. Mathematically clustering can be implemented on such links but the results make no sense; • During the traditional clustering process, the similarity measure of an object with a cluster is either by measure the object with a representative of the cluster or with the 'mean' of this cluster. Such definition is easy to understand. However it is hard to decide or define which one is representative of the congestion area or the mean of the congestion area.
However, some principles of traditional clustering algorithm can still be applied. Define J m as the subgraph that contains all links which don't belong to subgraph J m . For a jam cluster within a network, the definitions of a congestion cluster can be summarized as the following two conditions: In order to get the jam cluster J m , a similarity threshold s is set, the state above which is considered as congested.
Based on the above considerations, rather than using the traditional clustering framework, we introduce a metaheuristic algorithm for our congestion area identification. This algorithm first sets a similarity threshold, and then chooses the two links with the maximum similarity as a congestion core, and gradually attaches the neighboring links to the core to form a new congestion are and then repeat the attachment process until there are no neighboring links that satisfy the similarity threshold.
There are three levels of clusters according to the homogeneous of the traffic states.

Metaheuristic Algorithms
Single cluster. Suppose within a time domain, there is only one jam cluster in the road network. And the traffic states of all links within the jam are simultaneous, i.e. the traffic velocity decrease and restore back at the same trend. Thus the overall traffic state is uniform across the network.
Under such condition, the problem is typically a clustering problem and the clustered objects are all links. The similarity between objects is given by the models above. Suppose the jam cluster is denoted by J m where m is a specific time domain within which the traffic state can be considered as uniform. And J m can be expressed as J m = {θ i } where link i is considered as a congested link. Add all links that belongs to cluster J m form a weak-connected subgraph. Thus J m is also a subgraph of the original directed network.
The pseudo code for the single clustering problem can be formulated as in Fig 5: Multiple simultaneous cluster in time interval. When there are some clusters at the same time and these congestion clusters do not overlap with each other, the above procedure will have some clusters left.
Since the clusters don't share some common links, when one cluster process is carried out, the same process can applied to the remained network again and get another cluster. Repeat the process until there is no cluster.
Therefore the pseudo code for the multiple clustering problem can be formulated as in Fig 6: Multiple non-simultaneous cluster. Multiple non-simultaneous cluster means that these clusters do not take places at the same pace. For example some cluster may be at the time interval of 08:00~09:00 and others may be 08:30~09:15. In order to get this, the time horizon method is used. In the method, multiple Simultaneous Clustering process is used for one time domain, and the time domain rolls forward, and multiple Simultaneous Clustering process is applied again. Thus the jam clusters across the whole time horizon can be derived. Since the process is relatively simple, the pseudo code is omitted here.

Field Tests of Model
This section applies the method above to HangZhou city, China to identify the dynamic bottleneck and presents the evaluation results of the network.

Data Description
The data includes two parts: the geographical data set and the velocity data set. The geographical data is in a shape file which is developed and regulated by Esri and includes the detailed coordinates of each link. The adjacent relationship of all links also is covered. This data set helps to visualize the traffic state in the network. The other data is the average vehicle velocity data, which is the core of the paper. The raw data for calculating vehicle velocity is the taxi GPS data collected from 2012-11-01 to 2012-11-30 in city of Hangzhou, China. There are totally about 9000 taxis in the city at that time. The GPS equipment installed in each taxi sends second by second location and heading direction information to the traffic management center. The location can be matched to the geographical data to determine which link the taxi is driving on. In this way, the travel trajectory of each vehicle can be obtained. To calculate the velocity of each link, the entire modeling time horizon is divided into intervals of five-minutes. For a specific link i at time interval k, vehicles traversed the link is selected from raw data. Suppose that, for a vehicle selected within this time interval, the initial spatial-temporal coordinate is (t 1 ,x 1 ) and the final coordinate is (t 2 ,x 2 ), then the mean velocity within this spatial-temporal space can be calculated as ðx 1 ;x 2 Þ ðt 1 ;t 2 Þ . These mean velocities are then averaged over all the vehicles to obtain the link velocity. For each link, there will be totally 288 successive velocities within an analysis period of 24 hours. The directional velocity is not considered. But for the description of the Spatial-Temporal Congestion Identification whole network, such simplification is acceptable. A two-way street is modeled as two links here, and the velocity is calculated separately for each link.
The studied network covers all the links within the "Ring Highway" of Hangzhou City, as shown in Since the majority of the roads velocity is available, and the number of links is relatively low in the early morning, and we mainly focus on the peak hours rather than off-peak hours, such coverage is acceptable.
Due to the huge computation issue, just part of the whole network is used. The sub-network is 5km Ã 5km square area, as shown in Fig 9. In this area there are 2654 links, totally 5008 pair of adjacent links. For each link we get the velocity of one day and calculate the time rolling horizon similarity and cluster the jam areas based on the similarity measurement. The second order similarity is calculated using cosine similarity method. There data limit is based on velocity limit for each road rank. The ranks include principle arterial road, minor arterial road, principle branch, expressway and highway. The speed limits are as follows: principle arterial road limit is 50km/h; minor arterial road limit is 50km/h; principle branch road velocity limit is 40km/h; expressway limit is 80km/h; highway limit is 120km/h.

Preliminary Results
We set the time domain length to 2 hours, which means the similarity is carried out with two hours. The rolling time step is either 2 hour. Therefore for a whole day, the clustering process is carried out 12 times. The congestion clusters within each time domain are generated. it can be observed that major clusters distribute along eastern-western and southern-northern arterials, typically the skyway, on-off ramps, intersections at major arterial roads. The covered links number occupies about 16% of the whole network. However, it is not the case for 02:00~04:00, when the morning traffic peak is not approaching yet. Fig 11 gives the results. It is shown that just the major skyway across the whole area and some major intersections is identified as congestion clusters. This is due to the fact that before the peak hour, some areas have relatively higher demand than other areas, typically the residential communities rather than the major roads. Due to the fact that most commuters choose the skyway as a route, it remains higher similarity even at this time domain. This can also be observed from Fig  12, where the number of clusters are presented along the time domains. At the first time domain, i.e. 00:00~02:00,due to the large scale missing data, many neighboring links both with many missing data points have a higher similarity, thus the number of links covered by the clusters are maximal. But it does not mean that the congestion is severe at that moment.
We change the time domain length to 1.5 hour and the rolling time step to 1.5 either to check the clustering result. Fig 13 gives the results for time from 07:30 to 09:00. All the clusters occupy 22% of the network, which is more than that of the same time period when the time domain length is 2 hours. Also the major arterial roads intersection and the skyway are identified as congestion clusters.
For each time domain, many clusters of congestion links are identified. Some links may be identified as jam cluster in some domains while do not exist in other domains. Some links are always recognized as jam clusters. Thus the common links can be viewed as recurrent bottlenecks. Fig 14 displays the common links number along time. For the whole day, there is no common links that exist at all time domains. It indicates that there are no areas or links that are always congested. The trend of the common links number is very alike to Fig 12. In the deep night, due to the large scale loss of data, there are many links share higher similarities. When network traffic demands increase, the dis-similarity between areas makes it difficult to find a cluster. During the day time, the common links number firstly increases and then decreases.  Spatial-Temporal Congestion Identification the similarity across the whole network. The threshold can be given in two ways: either by giving a scalar or by the percentile of all similarities. Here the latter method is used. A percentile of all the similarities is set as a threshold and then the clustering process is implemented. When Spatial-Temporal Congestion Identification the percentile is 100%, then the resulting cluster would be zero since no similarity exceeds this threshold. When the threshold is set at 0%, the resulting cluster will be the whole network. Fig  15 presents the results applying different clustering threshold from the time domain 07:00 to 08:00. The proportion of links is calculated based on the target network, which contains 2654 links.
It seems that the size of clusters changes linearly with the percentile threshold. 90% percentile cluster occupy about 30% of the whole network, which implies that the network operates at a higher simultaneous manner.

Conclusion
Traffic congestion on the road network involves both spatial and temporal variation. Information of when and where there is congestion will greatly benefit the following traffic flow operational state improvement. The task sometimes is difficult due to the missing data problem. This research deals with this problem using two stage measuring process. In the first stage the so called first order similarity is calculated given that the traffic flow variable is bounded at both sides. And the first order similarity is used to generate the second order similarity as the output. The spatial-temporal congestion is clustered in a heuristic manner based on the second-order similarity results.
Since the traffic flow data are collected network wide. If the network topology is considered, the accuracy of spatial-temporal congestion identification may be improved. For example, the shortest path, the feasible path set can be incorporated. And the similarity measurement may between unequal-length traffic flow data series. Another more important task is how to evaluate the congestion. Some indexes are possibly proposed to quantify the congestion, i.e. congestion quantum which is calculated by dividing congestion area by spatial-temporal resource.

Spatial-Temporal Congestion Identification
Needless to say, such tasks are difficult. When suitable tools such as dynamic traffic assignment is combined, the interpretation and the further application of the method proposed is possible to be improved to achieve such tasks. These tasks will be next stage work.
Supporting Information S1 File. Original map data and velocity data in the paper. (ZIP)