A Hidden Markov Model for Urban-Scale Traffic Estimation Using Floating Car Data

Urban-scale traffic monitoring plays a vital role in reducing traffic congestion. Owing to its low cost and wide coverage, floating car data (FCD) serves as a novel approach to collecting traffic data. However, sparse probe data represents the vast majority of the data available on arterial roads in most urban environments. In order to overcome the problem of data sparseness, this paper proposes a hidden Markov model (HMM)-based traffic estimation model, in which the traffic condition on a road segment is considered as a hidden state that can be estimated according to the conditions of road segments having similar traffic characteristics. An algorithm based on clustering and pattern mining rather than on adjacency relationships is proposed to find clusters with road segments having similar traffic characteristics. A multi-clustering strategy is adopted to achieve a trade-off between clustering accuracy and coverage. Finally, the proposed model is designed and implemented on the basis of a real-time algorithm. Results of experiments based on real FCD confirm the applicability, accuracy, and efficiency of the model. In addition, the results indicate that the model is practicable for traffic estimation on urban arterials and works well even when more than 70% of the probe data are missing.


Introduction
Traffic congestion has become a severe problem in metropolises, resulting in widespread wastage of time and energy [1]. Traffic monitoring and estimation is an important method for obtaining information on traffic conditions; thus, it plays a vital role in reducing traffic congestion [2]. Static sensors (inductive loop detectors [3], video cameras [4], etc.), deployed at fixed locations on roads, are used to detect traffic state (e.g., flow velocity and traffic density). However, it is difficult for these traditional approaches to cover all roads because they involve extensive infrastructure deployment and high maintenance costs [5].
With the rapid development of mobile technologies, recent years have witnessed the emergence of a new method known as floating car data (FCD) for collecting valuable real-time information on traffic conditions. In this method, vehicles (e.g. taxis and buses) equipped with global positioning systems (GPS), accelerometers, and other sensors can provide data such as position, velocity, and acceleration of the vehicle. FCD is not as expensive as traditional data acquisition methods because it requires no dedicated infrastructure. Moreover, it has the potential to provide good spatiotemporal coverage of the transportation network and useful data given a certain penetration rate in the population [6,7].
In some previous studies, FCD has been used to estimate traffic conditions on highways, providing good results with a low penetration rate (1%-3%) [7][8][9]. Compared to traffic estimation on freeways, traffic estimation on arterials is more complex because of the traffic lights and intersections, and it requires a greater number of samples for analysis. Several researches have discussed the minimum penetration rate required. For example, Breitenberger et al. [10] proposed a penetration rate of 10% on arterial and urban roads. In addition, Vandenberghe et al. [11] discussed the maximum sample interval and the maximum transmission interval of aggregated samples, where the former defines the time between two consecutive FCD samples captured by the same floating car and the latter defines the time between two consecutive server uploads of all new samples by a floating car.
However, it is difficult for FCD to meet the sampling requirements in practice, and the distribution of observed probe data may be sparse and uneven. Traffic state estimation using sparse probe data has not been explored extensively. Herring et al. [12] proposed a probabilistic modeling framework for estimating arterial travel time distribution using sparse probe data. They modeled the evolution of traffic states as a coupled hidden Markov model (HMM), in which the traffic states of nearby road segments are correlated and evolve over time in a Markov manner. The present study differs from their study in that it considers links with similar traffic conditions instead of adjacent links of the road network, which may improve the modeling accuracy. Yanmin et al. [13] revealed the hidden structures within the traffic conditions of a road network using principal component analysis (PCA) and proposed a compressive sensing-based algorithm for obtaining the missing traffic conditions. However, they simply developed an offline data analytics algorithm that cannot be applied to real-time traffic estimation.
The present study proposes an HMM-based model that focuses on overcoming the problem of data sparseness for traffic estimation using FCD. It is assumed that the traffic state of a road segment is invisible and that each road segment belongs to a cluster of road segments having similar traffic characteristics. The traffic conditions of the other road segments in the cluster are considered as observations, based on which an HMM can be constructed. An algorithm based on clustering and pattern mining is proposed to find all road segment clusters in which segments have similar traffic characteristics, and a multi-clustering strategy is adopted to achieve a trade-off between clustering accuracy and coverage. Through data analysis, two exponential distribution functions are used for computing emission probability and transition probability. Finally, a real-time estimation algorithm is developed for online traffic application. The results of extensive experiments conducted using real floating car data show that our model works well even when more than 70% of the probe data are missing.
The remainder of this paper is organized as follows. Section 2 describes the problem of traffic estimation using sparse probe data. Section 3 discusses the construction of an HMM-based traffic estimation model, outlines the main steps of the proposed approach, and presents an algorithm for real-time traffic estimation. Section 4 describes the implementation of the proposed model and as well as a case study for assessing the accuracy of the model. Finally, Section 5 summarizes our findings and concludes the paper. the vehicle. A road network G is divided into a set of road segments, R = {r n |n = 1, 2, . . ., N}, by intersections. A map-matching algorithm is used to find the road segment on which the vehicle is traveling at time t. In order to facilitate statistical analysis, a set of predefined time slots, T = {t m |m = 1, 2, . . ., M}, instead of continuous time, is employed for traffic condition estimation. Then, the state of vehicle s is converted into s<id, r, v, t>.
In the field of traffic engineering, several metrics have been proposed for quantifying the traffic condition of a link, such as speed [14], density [15], flow [16], and queues at intersections [17]. Furthermore, Many traffic flow models [18][19][20][21][22][23][24][25][26][27][28][29] have been proposed to study complex traffic conditions. The present study employs the velocity of the traffic flow on a road segment, as in some previous studies [1,13,14,30,31]. The floating cars are part of the traffic flow; hence, it is reasonable to consider their speed as the speed of the traffic flow. The speed of the traffic flow on segment n at time slot m, x n,m , is approximated as the average speed of all vehicles moving within the traffic flow on this road segment at time slot m. Then, the traffic condition of the road network can be expressed by matrix X as follows: x r;t . . .
Here, the row X r = {x r,m |m = 1, 2, . . ., M} represents the traffic condition sequence of road segment r over time. Because of the randomness and unevenness of the floating car data, it is difficult to obtain a complete traffic condition matrix, as there are many spatiotemporal vacancies with no probe measurements. As shown in Fig 1, for the traffic condition sequence of a road segment, there may be some sub-sequences without sample data. Hence, in this study, the main objective of traffic estimation is to estimate the values of these missing states, which can approximate the true states.

Estimation model based on HMM
In this paper, an HMM-based estimation model is proposed to estimate missing traffic state sequences. An HHM, which is based on the concept of Markov process and Markov chain, is characterized by five elements: observation, hidden state, state transition probability, emission probability, and initial state. The first step in HMM construction is to establish the observations and hidden states of the model. In this study, the traffic condition of the target road segment r at time slot t is the hidden state x r,t . It is assumed that the road segment r belongs to a cluster C in which all road segments have similar traffic characteristics. Then, the observations y r,t are defined as the traffic conditions of the other road segments in the cluster. In some previous studies [16,32], adjacent road segments have been assumed to be correlated with each other. However, in practice, such an assumption may be not very accurate. A method based on clustering and frequent pattern mining is proposed in Section 3.2 in order to find clusters having road segments with similar traffic characteristics. In this study, speed, which is a continuous variable, is employed as the traffic condition; thus, the hidden state has an infinite number of values. Therefore, it is necessary to select finite candidate states for the HMM process. The state value range at t can be limited according to the observation y r,t and previous state x r,t-1 ; then, the range should be discretized to a candidate state set CS r,t = {x i r,t |i = 1, 2, . . ., k}.
The emission probability, Pr(y r,t |x r,t ), is the likelihood of observing the traffic condition y r,t conditional on the traffic condition x r,t being the true condition of the road segment r at time slot t. The transition probability, Pr(x r,t , x r,t+1 ), is the probability that the traffic condition of the road segment r will transform from a state x r,t at time t to another state x r,t+1 at time t+1. The methods for measuring and calculating the emission probability and transition probability are discussed in Section 3.3.
The HMM sequentially generates candidate traffic condition sequences and evaluates them on the basis of their likelihood, which is measured by the joint probability (Fig 2). Past hypotheses of the solution are extended to account for new observations over time. Then, the surviving sequence with the highest joint probability is selected from among the remaining candidates of the previous stage as the final solution. The joint probability is expressed as

Traffic similarity analysis
Clustering analysis. In this study, the traffic condition sequence, X r = {x r,t |t = 1, 2, . . ., M}, is considered as the traffic characteristic of the road segment r over a period of time. A spectral clustering algorithm is adopted to divide the road segment set into clusters based on historical data, and the road segments in the same cluster have similar traffic characteristics.
In the spectral clustering algorithm, the set of points in an arbitrary feature space can be represented as a complete weighted undirected graph G(V, E). The vertices of the graph G are the points in the feature space and the weight w ij of an edge (v i , v j ) in E is a measure of the similarity between vertex v i and v j . In this context, we can formulate the clustering problem as a graph-partitioning problem that requires partitions V 1 , V 2 . . ., V k of the vertex set V according to some measure; then, the vertices in any set V i have a high degree of similarity, and the vertices in two different sets V i , V j have a low degree of similarity.
For road segment clustering, the road segments are considered as vertices of the graph G. The weight w ij between the road segments i and j is the difference between the traffic characteristics of the two road segments; it can be expressed by a Euclidean distance as follows: where M is the number of time slots in the traffic condition sequence and x i,t is the traffic condition of road segment i at time slot t. A normalized spectral clustering algorithm (Box 1) is constructed according to previous research [33]: In practice, it is difficult to determine a suitable number of clusters for road segment clustering. Therefore, a modified clustering algorithm is proposed, and the average weight w av of a cluster, instead of the cluster number k, is set as a constraint for controlling the clustering process. In order to simplify the computation, w av is defined as the average weight between the centroid of a cluster and other objects. The centroid is given by where V k denotes the vertices of the k-th cluster, |V k | is the number of vertices in V k , v i is the ith vertex of V k , and vc k is the centroid of V k ,. Then, the average weight of V k can be expressed Box 1. Spectral clustering algorithm. 3: L I À D À 1 2 Á W Á D À 1 2 ; //Compute normalized Laplacian 3: Compute the first k eigenvectors u 1 ,. . . u k of L; 4: Let U be the matrix containing the vectors u 1 ,. . ., u k as columns; 5: Use the k-means algorithm to cluster U, then get the clusters C; 6: return C; In the algorithm (Box 2) based on the constraint ω, the vertices of G are divided into small clusters step by step, and the cluster whose w av is greater than the threshold value ω should be divided into smaller clusters until the constraint is met. Because clusters with only one object are meaningless, it is reasonable to set a minimum number of objects in a cluster (N min ). A small value k is set as the number of clusters in every clustering step. In this study, both k and N min are set as 2.
Pattern mining. To avoid coincidental clusters, it is reasonable to perform clustering multiple times for different days and find frequent clusters with a better representation of traffic similarity between road segments. In general, the traffic condition exhibits different patterns on weekdays and weekends. As shown in Fig 3, the traffic conditions of arterial roads in Beijing have similar characteristics on weekdays but significantly different characteristics on weekends. Therefore, the traffic conditions should be discussed separately.
The road segment set, R = {r n |n = 1, 2, . . ., N}, is divided into cluster set C d = {R k |k = 1, 2, . . ., K} according to the traffic condition of the d-th day using the clustering algorithm proposed in Section 3.2.1. The cluster set list, L = {C d |d = 1, 2, . . ., D}, contains all clusters of the last D days (weekdays or weekends) from the target day. The objective of the frequent pattern mining approach adopted in this study is to find the frequent cluster set, P = {R j &R |j = 1, 2, . . ., J}, where the cluster R j appears frequently in L. An indicator function is used to indicate whether R j appears in the cluster set C d : for i 1 to k do 5: if the average weight w av of V i is greater than ω 6: Get sub-graph G i from G corresponding to V i ; 7: In this study, the number of times the cluster R j appears in the cluster set L is defined as the support, and it can be calculated as follows: The frequent cluster must meet the minimum support, Sup min ; then, the frequent cluster set can be defined as follows: . . . J; and supportðR j ; LÞ ! Sup min g ð 8Þ It is difficult for traditional pattern mining algorithms (e.g., the Apriori algorithm) to compute and find the frequent clusters, as the number of road segments and clusters is extremely large. To overcome this problem, a frequent pattern mining approach based on intersection is proposed. The intersection between two cluster sets is expressed as where C 1 and C 2 are the cluster sets of two days, R i and R j are arbitrary clusters of sets C 1 and C 2 , respectively, and R k is the intersection of R i and R j . Note that R k is meaningful for estimation only if it includes more than one road segment. Therefore, in the intersection process, the cluster R k will be discarded when | R k |<2. As shown in Fig 4, the frequent cluster set can be obtained by gradual intersection. For a cluster set list L i , which has i cluster sets, the frequent clusters that appear i times in L i can be found by a recursive algorithm based on intersection; this algorithm can be expressed as where C j is an arbitrary cluster set in the list L i ; if L i contains only one cluster set C j , then Pattern(L i ) will return C j . In order to find all frequent clusters that appear i times in L, it is necessary to obtain the combination set, given by where L i com includes all combinations that contain i cluster sets of L; the number of combinations is jLj i À Á . Then, the frequent cluster set can be obtained as where FC i contains all frequent clusters whose support is i. Then, all frequent clusters that appear more than Sup min times can be obtained using the following algorithm (Box 3): Multi-clustering strategy. As discussed in Section 3.2.1, the smaller the value of the constraint ω, the more similar are the traffic characteristics of the road segments in the same cluster; this may improve the estimation accuracy. However, the frequent cluster set covers fewer road segments because of the more stringent constraint. To resolve this conflict, a multipleclustering strategy is adopted. Multiple constraints ω 1 , ω 2 ,. . .ω m are selected for clustering and pattern mining; then, a list of frequent cluster sets, FCL = {FC ω |ω = ω 1 , ω 2 , . . ., ω m }, is generated, where FC ω is the frequent cluster set corresponding to the constraint ω. In the process of finding the cluster that contains the road segment r, the frequent cluster set with smaller ω should be considered first.

Probability calculation
Emission probability. For a road segment r that belongs to the frequent cluster C, its traffic condition x r,t at time slot t approximates the traffic conditions y r,t of other road segments in C. Thus, the difference diff between x r,t and y r,t can be adopted to calculate the emission probability Pr(y r,t |x r,t ). According to observations, the emission probability follows an exponential distribution; hence, it can be calculated as Pr y r;t jx r;t ð Þ¼le ÀlÁdiff ¼ le where, λ is the parameter of the exponential distribution (λ>0) anddiff is the average traffic condition difference between the road segment r and road segments in C' = {r'2C| r'6 ¼r}.
In order to find the appropriate frequent cluster C, the frequent cluster set FC ω with a smaller constraint value ω should be considered preferentially, and in the frequent cluster set FC = {FC i |i = Sup min , Sup min +1, . . ., D}, the clusters that appear more frequently should be considered first.
Transition probability. Through data analysis, in a relatively short time interval, the traffic condition at time slot t+1 is close to that at time slot t. Thus, the traffic state change Δx = |x r,t -x r,t+1 | is employed to measure the state transition. According to observations, the state transition probability follows an exponential distribution, and it can be expressed as where, β is the parameter of the exponential distribution (β>0).

Candidate selection
As discussed in Sections 3.2 and 3.3, the state x r,t+1 may approximate the previous state x r,t and the observations y r,t+1 = {x i,t+1 |i2C and i6 ¼r}, where C is the frequent cluster that contains the road segment r. Then, the value range of x r,t+1 is set as [x min -μ, x max +μ], where x min = min({x r,t }[y r,t+1 ), x max = max({x r,t }[y r,t+1 ), and μ is used to avoid missing valid values. For computational convenience, the range should be discretized to finite candidates denoted by the set where N cand is the number of candidates. In order to facilitate algorithm design, the candidate set is CS = {x r,t+1 } when the state at time slot t+1 is obtained from the samples. Then, it is not necessary to find the missing state sub-sequences discussed in Section 2; the entire state sequence can be estimated in a single process.

Real-time algorithm
For the road segment r at time slot t, a list PreSList = {Se i |i = 1,2,. . .m} is used to store previous surviving state sequences. A sequence is denoted by Se = (SS, JP), where SS = {x r,1 ,x r,2 ,. . .,x r,t-1 } stores previous consecutive candidate states and JP is the joint probability. Using Algorithm 4 (Box 4), the current candidate sequence list SList is obtained according to PreSList and the candidate states CS t of road segment r at time slot t. Then, the state sequence with the maximum joint probability in SList is the optimal solution of road segment r at time slot t; the sequence is given by argmax Se2SList {Se.JP}. For the first state, the initial joint probability of the state sequence is the emission probability of the candidate states. Obviously, the algorithm can output the estimated states in real time; thus, it is applicable to online application.

Results and Discussion
For the experiments, 8559 arterial road segments were selected; the roads cover the main regions of central Beijing. The traffic conditions between 6:00 and 24:00 were considered, and the time was divided into 108 time slots at 10-min intervals (e.g., the first time slot was 6:00-6:10 and the 12 th time slot was 7:50-8:00).
The taxi trajectory data in Beijing during November 2012 served as the FCD data, obtained from 12,600 taxis. The data samples of six weekdays were selected for a case study; five of these days were used for frequent cluster mining and parameter estimation, and the remaining day was used to test the estimation model. Before the experiments were performed, the trajectory data were matched to the road network using map-matching methods [34][35][36], and anomalous samples were eliminated.
The model was implemented using a Java platform on a computer having a quad-core CPU (2.2 GHz) and 8-GB memory.

Frequent cluster mining
Six constraint values {ω |ω = 10, 15, 20, 25, 30, 35} were considered in the clustering analysis stage. The average weight w av discussed in Section 3.2.1 was employed to measure the degree of similarity, which decreased as w av increased. As shown in Table 1, the mean w av of the cluster set and the average number of objects in each cluster, ON average , increase with ω. Clusters having a single object cannot be used for estimation; the proportion of such clusters, r single , decreases as ω increases. For traffic estimation, a perfect cluster set has small average w av , large ON average , and small r single . A cluster set having small average w av is more likely to have small ON average and large r single , which confirms the existence of the contradiction discussed in Section 3.2.3. Therefore, it is necessary to adopt a multi-clustering strategy.
As shown in Fig 5, the traffic characteristics of the road segments in the same cluster have a very high degree of similarity when the average weight w av is small, such as clusters a, b, and c. As the average weight w av increases, the degree of similarity of the cluster decreases and the number of the objects in the cluster increases.
Samples of five days were selected for frequent cluster mining, and the minimum support Sup min was set as 3. Table 2 lists the coverage rates of the frequent clusters, which is given by the ratio R cover = N cover /N total , where N cover is the number of road segments in the frequent cluster set and N total is the total number of road segments. The coverage rate increases with ω, and the support of the most frequent clusters is less than or equal to 4. When ω increases to 35, the coverage rate of the frequent cluster set reaches 96.76%, which indicates that the set of these ω is sufficient and appropriate for this study.
The road segments that are adjacent to each other may have similar traffic characteristics; this property can be used instead of clustering for finding similar road segments. However, in contrast to our assumption, this is not very likely in practice. As shown in Fig 6, although the proportion of road segments whose adjacent segments are in the same cluster increases with ω, this proportion is still low. Therefore, it is more reasonable to find similar road segments by clustering rather than by adjacency relationships.

Parameter estimation
Statistical analysis of the distribution of diff was carried out in order to estimate the parameter λ in (13). In different frequent cluster sets corresponding to specific values of ω, the distribution of diff is different. In order to observe the distribution of diff, we calculated the ratio of each diff value to the total number of samples. As shown in Fig 7, the steepness of the distribution curve increases with ω, which indicates that the road segments in the frequent cluster  generated on the basis of a smaller ω are more likely to have a higher degree of similarity, because the probability that diff takes a smaller value is higher. The parameter λ was calculated as 1/E(diff), where E(diff) is the expectation of diff, and the equation was initialized with an initial parameter λ Ã ; then, the parameter was learned by iterative computation until it converged to a specific value. Table 3

Model accuracy and efficiency
A state sequence set of 8559 arterial road segments was prepared for testing. Because it is difficult to obtain the complete state set of these road segments, the states that were actually obtained were considered for accuracy analysis; the total number of these states, N base , was 6.19×10 5 . Among these states, a number of states were randomly selected as the missing states that need to be estimated. The missing state rate is denoted by R miss = N miss /N base , where N miss is the number of missing states. The mean absolute error (MAE) was employed to measure the estimation accuracy, and it is given by where N estim is the number of states estimated,x i is the estimated value of the i-th state, and x i is the true value of the i-th state. The accuracies of two models, Model 1 and Model 2, were compared. Model 1 is the proposed model, which finds road segments with similar traffic states via clustering and frequent pattern mining, whereas Model 2 assumes that adjacent road segments have similar traffic conditions. As shown in Fig 9, MAE increases with R miss , and the MAE of Model 2 is significantly higher than that of Model 1, which implies that the proposed model is more accurate.
If there is no reference state, such as the previous state or the states of similar road segments, for the state x r,t , the state cannot be estimated. The rate of the states that cannot be estimated is R miss -R estim ; then, the number of states that can be estimated is R valid = 1-(R miss -R estim ), where R estim = N estim /N base . As shown in Table 4, MAE increases gradually before R miss reaches 83.33%, and R valid remains high until R miss reaches 92.59%, which indicates that the model is applicable to very sparse sample data.
The cumulative distribution function (CDF) of the estimation error E is given by where estimation error E is the absolute value of the difference between the estimated and observed values, N estim is the number of estimated states, and N(E e) is the number of estimated states whose error is less than or equal to e. Fig 10 shows the CDFs of estimation errors corresponding to different values of the missing state rate R miss . Before R miss reaches 83.33%, the CDF curve is steeper, which indicates that most errors are small. For example, when R miss = 83.33%, more than 52.79% of the errors are less than or equal to 5 and more than 76.88% of the errors are less than or equal to 10. The states that are obtained are considered as the true states; then, the estimated error distribution function in the global scope is given by Table 5 summarizes the global distribution of the estimated error, which reflects the accuracy of the model corresponding to different values of the missing state rate R miss . According to the error distribution, it is easy to determine whether the accuracy of the model meets the requirements of the application. For example, in an application that requires 90% of the errors are to be less than 5 km/h, if the missing state rate is less than 18.52%, then the model may be suitable for the application.
In our data source, the missing state rate of the arterial roads was around 33%, while the missing state rate of the other roads was around 65%. Samples of 8559 arterial roads were considered. The error was less than or equal to 5 (resp. 10) for more than 84.84% (resp. 92.61%) of the states; this indicates a high estimation accuracy. Fig 11(A) shows the traffic state map of the arterial roads in Beijing at the 50 th time slot (14:10-14:20) before estimation, and Fig 11(B) shows the states of the roads after estimation. Most of the missing states were estimated, and the estimated values were very close to the true values.
The main factor that affects the efficiency of the model is the number of candidates for the hidden states, N cand , which has been discussed in Section 3.4. Several values of N cand were selected for the experiments, where the missing state rate was around 74%. The results show that the accuracy improved as N cand increased; however, the time cost increased significantly

Conclusion
This paper presented an effective and efficient HMM-based model for urban-scale traffic estimation using floating car data. Clustering analysis and pattern mining were adopted to analyze a large data set of real probe data collected from a fleet of 12,600 taxis in Beijing, China, and it was found that there exist frequent clusters in which the road segments have similar traffic characteristics. Comparative analysis showed that the model based on clustering is more effective than the model based on adjacency relationships for traffic estimation. In order to achieve a trade-off between clustering accuracy and coverage, a multi-clustering strategy was adopted  in the estimation process. Experimental results showed that the model can be applied to different scenarios; even when more than 70% of the original data are missing, the model can guarantee that more than 80% of the states have relatively small errors. In addition, the model was implemented using a real-time algorithm, which offers higher precision and has a broader scope for application than some offline traffic estimation algorithms.