## Figures

## Abstract

Urban-scale traffic monitoring plays a vital role in reducing traffic congestion. Owing to its low cost and wide coverage, floating car data (FCD) serves as a novel approach to collecting traffic data. However, sparse probe data represents the vast majority of the data available on arterial roads in most urban environments. In order to overcome the problem of data sparseness, this paper proposes a hidden Markov model (HMM)-based traffic estimation model, in which the traffic condition on a road segment is considered as a hidden state that can be estimated according to the conditions of road segments having similar traffic characteristics. An algorithm based on clustering and pattern mining rather than on adjacency relationships is proposed to find clusters with road segments having similar traffic characteristics. A multi-clustering strategy is adopted to achieve a trade-off between clustering accuracy and coverage. Finally, the proposed model is designed and implemented on the basis of a real-time algorithm. Results of experiments based on real FCD confirm the applicability, accuracy, and efficiency of the model. In addition, the results indicate that the model is practicable for traffic estimation on urban arterials and works well even when more than 70% of the probe data are missing.

**Citation: **Wang X, Peng L, Chi T, Li M, Yao X, Shao J (2015) A Hidden Markov Model for Urban-Scale Traffic Estimation Using Floating Car Data. PLoS ONE 10(12):
e0145348.
doi:10.1371/journal.pone.0145348

**Editor: **Tieqiao Tang, Beihang University, CHINA

**Received: **August 25, 2015; **Accepted: **December 2, 2015; **Published: ** December 28, 2015

**Copyright: ** © 2015 Wang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

**Data Availability: **Data have been deposited to Figshare: http://dx.doi.org/10.6084/m9.figshare.1618856.

**Funding: **TC is supported by a National Key Technology Support Program (2015BAJ02B00) and Ministry of Science and Technology Policy Guidance Project (2011FU125Z24). URL: http://program.most.gov.cn/. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

**Competing interests: ** The authors have declared that no competing interests exist.

## Introduction

Traffic congestion has become a severe problem in metropolises, resulting in widespread wastage of time and energy [1]. Traffic monitoring and estimation is an important method for obtaining information on traffic conditions; thus, it plays a vital role in reducing traffic congestion [2]. Static sensors (inductive loop detectors [3], video cameras [4], etc.), deployed at fixed locations on roads, are used to detect traffic state (e.g., flow velocity and traffic density). However, it is difficult for these traditional approaches to cover all roads because they involve extensive infrastructure deployment and high maintenance costs [5].

With the rapid development of mobile technologies, recent years have witnessed the emergence of a new method known as floating car data (FCD) for collecting valuable real-time information on traffic conditions. In this method, vehicles (e.g. taxis and buses) equipped with global positioning systems (GPS), accelerometers, and other sensors can provide data such as position, velocity, and acceleration of the vehicle. FCD is not as expensive as traditional data acquisition methods because it requires no dedicated infrastructure. Moreover, it has the potential to provide good spatiotemporal coverage of the transportation network and useful data given a certain penetration rate in the population [6, 7].

In some previous studies, FCD has been used to estimate traffic conditions on highways, providing good results with a low penetration rate (1%–3%) [7–9]. Compared to traffic estimation on freeways, traffic estimation on arterials is more complex because of the traffic lights and intersections, and it requires a greater number of samples for analysis. Several researches have discussed the minimum penetration rate required. For example, Breitenberger et al. [10] proposed a penetration rate of 10% on arterial and urban roads. In addition, Vandenberghe et al. [11] discussed the maximum sample interval and the maximum transmission interval of aggregated samples, where the former defines the time between two consecutive FCD samples captured by the same floating car and the latter defines the time between two consecutive server uploads of all new samples by a floating car.

However, it is difficult for FCD to meet the sampling requirements in practice, and the distribution of observed probe data may be sparse and uneven. Traffic state estimation using sparse probe data has not been explored extensively. Herring et al. [12] proposed a probabilistic modeling framework for estimating arterial travel time distribution using sparse probe data. They modeled the evolution of traffic states as a coupled hidden Markov model (HMM), in which the traffic states of nearby road segments are correlated and evolve over time in a Markov manner. The present study differs from their study in that it considers links with similar traffic conditions instead of adjacent links of the road network, which may improve the modeling accuracy. Yanmin et al. [13] revealed the hidden structures within the traffic conditions of a road network using principal component analysis (PCA) and proposed a compressive sensing-based algorithm for obtaining the missing traffic conditions. However, they simply developed an offline data analytics algorithm that cannot be applied to real-time traffic estimation.

The present study proposes an HMM-based model that focuses on overcoming the problem of data sparseness for traffic estimation using FCD. It is assumed that the traffic state of a road segment is invisible and that each road segment belongs to a cluster of road segments having similar traffic characteristics. The traffic conditions of the other road segments in the cluster are considered as observations, based on which an HMM can be constructed. An algorithm based on clustering and pattern mining is proposed to find all road segment clusters in which segments have similar traffic characteristics, and a multi-clustering strategy is adopted to achieve a trade-off between clustering accuracy and coverage. Through data analysis, two exponential distribution functions are used for computing emission probability and transition probability. Finally, a real-time estimation algorithm is developed for online traffic application. The results of extensive experiments conducted using real floating car data show that our model works well even when more than 70% of the probe data are missing.

The remainder of this paper is organized as follows. Section 2 describes the problem of traffic estimation using sparse probe data. Section 3 discusses the construction of an HMM-based traffic estimation model, outlines the main steps of the proposed approach, and presents an algorithm for real-time traffic estimation. Section 4 describes the implementation of the proposed model and as well as a case study for assessing the accuracy of the model. Finally, Section 5 summarizes our findings and concludes the paper.

## Problem Description

There are numerous floating cars running on the roads. They upload their state information, such as location, speed, and direction, from time to time. The state of a floating car at time *t* is expressed as *s<id*, *l*, *v*, *t>*, where *id*, *l*, and *v* denote the ID, location, and speed, respectively, of the vehicle. A road network *G* is divided into a set of road segments, *R = {r*_{n} *|n = 1*, *2*, *…*, *N}*, by intersections. A map-matching algorithm is used to find the road segment on which the vehicle is traveling at time *t*. In order to facilitate statistical analysis, a set of predefined time slots, *T = {t*_{m} *|m = 1*, *2*, *…*, *M}*, instead of continuous time, is employed for traffic condition estimation. Then, the state of vehicle *s* is converted into *s<id*, *r*, *v*, *t>*.

In the field of traffic engineering, several metrics have been proposed for quantifying the traffic condition of a link, such as speed [14], density [15], flow [16], and queues at intersections [17]. Furthermore, Many traffic flow models [18–29] have been proposed to study complex traffic conditions. The present study employs the velocity of the traffic flow on a road segment, as in some previous studies [1, 13, 14, 30, 31]. The floating cars are part of the traffic flow; hence, it is reasonable to consider their speed as the speed of the traffic flow. The speed of the traffic flow on segment *n* at time slot *m*, *x*^{n,m}, is approximated as the average speed of all vehicles moving within the traffic flow on this road segment at time slot *m*. Then, the traffic condition of the road network can be expressed by matrix *X* as follows:
(1)

Here, the row *X*_{r} *= {x*^{r,m} *|m = 1*, *2*, *…*, *M}* represents the traffic condition sequence of road segment *r* over time. Because of the randomness and unevenness of the floating car data, it is difficult to obtain a complete traffic condition matrix, as there are many spatiotemporal vacancies with no probe measurements. As shown in Fig 1, for the traffic condition sequence of a road segment, there may be some sub-sequences without sample data. Hence, in this study, the main objective of traffic estimation is to estimate the values of these missing states, which can approximate the true states.

The value 1 indicates the presence of sample data and 0 indicates the absence of sample data at a time slot.

## Methods

### Estimation model based on HMM

In this paper, an HMM-based estimation model is proposed to estimate missing traffic state sequences. An HHM, which is based on the concept of Markov process and Markov chain, is characterized by five elements: observation, hidden state, state transition probability, emission probability, and initial state.

The first step in HMM construction is to establish the observations and hidden states of the model. In this study, the traffic condition of the target road segment *r* at time slot *t* is the hidden state *x*^{r,t}. It is assumed that the road segment *r* belongs to a cluster *C* in which all road segments have similar traffic characteristics. Then, the observations *y*^{r,t} are defined as the traffic conditions of the other road segments in the cluster. In some previous studies [16, 32], adjacent road segments have been assumed to be correlated with each other. However, in practice, such an assumption may be not very accurate. A method based on clustering and frequent pattern mining is proposed in Section 3.2 in order to find clusters having road segments with similar traffic characteristics. In this study, speed, which is a continuous variable, is employed as the traffic condition; thus, the hidden state has an infinite number of values. Therefore, it is necessary to select finite candidate states for the HMM process. The state value range at *t* can be limited according to the observation *y*^{r,t} and previous state *x*^{r,t-1}; then, the range should be discretized to a candidate state set *CS*^{r,t} *= {x*_{i}^{r,t} *|i = 1*, *2*, *…*, *k}*.

The emission probability, *Pr(y*^{r,t}*|x*^{r,t}*)*, is the likelihood of observing the traffic condition *y*^{r,t} conditional on the traffic condition *x*^{r,t} being the true condition of the road segment *r* at time slot *t*. The transition probability, *Pr(x*^{r,t}, *x*^{r,t+1}*)*, is the probability that the traffic condition of the road segment *r* will transform from a state *x*^{r,t} at time *t* to another state *x*^{r,t+1} at time *t+1*. The methods for measuring and calculating the emission probability and transition probability are discussed in Section 3.3.

The HMM sequentially generates candidate traffic condition sequences and evaluates them on the basis of their likelihood, which is measured by the joint probability (Fig 2). Past hypotheses of the solution are extended to account for new observations over time. Then, the surviving sequence with the highest joint probability is selected from among the remaining candidates of the previous stage as the final solution. The joint probability is expressed as
(2)
where *J*^{r,1} *= Pr(y*^{r,1}*|x*^{r,1}*)* and *CS*^{r,t} denotes the set of candidate states of the road segment *r* at time slot *t*. After the HMM process, the last traffic condition sequence with the maximum joint probability can be found: . Then, the system works backwards to find the traffic condition sequence *x*_{T-1}, *…*, *x*_{1} of the road segment.

### Traffic similarity analysis

#### Clustering analysis.

In this study, the traffic condition sequence, *X*_{r} *= {x*^{r,t} *|t = 1*, *2*, *…*, *M}*, is considered as the traffic characteristic of the road segment *r* over a period of time. A spectral clustering algorithm is adopted to divide the road segment set into clusters based on historical data, and the road segments in the same cluster have similar traffic characteristics.

In the spectral clustering algorithm, the set of points in an arbitrary feature space can be represented as a complete weighted undirected graph *G(V*, *E)*. The vertices of the graph *G* are the points in the feature space and the weight *w*_{ij} of an edge *(v*_{i}, *v*_{j}*)* in *E* is a measure of the similarity between vertex *v*_{i} and *v*_{j}. In this context, we can formulate the clustering problem as a graph-partitioning problem that requires partitions *V*_{1}, *V*_{2}*…*, *V*_{k} of the vertex set *V* according to some measure; then, the vertices in any set *V*_{i} have a high degree of similarity, and the vertices in two different sets *V*_{i}, *V*_{j} have a low degree of similarity.

For road segment clustering, the road segments are considered as vertices of the graph *G*. The weight *w*_{ij} between the road segments *i* and *j* is the difference between the traffic characteristics of the two road segments; it can be expressed by a Euclidean distance as follows:
(3)
where *M* is the number of time slots in the traffic condition sequence and *x*^{i,t} is the traffic condition of road segment *i* at time slot *t*. A normalized spectral clustering algorithm (Box 1) is constructed according to previous research [33]:

### Box 1. Spectral clustering algorithm.

Algorithm 1 SpectralClustering: spectral clustering

**Input**: *G(V, E)*: Traffic condition graph; *k:* Number of clusters;

**Output**: *C = {V _{1},V_{2},…V_{k}}*: clusters;

1: Get weighted adjacency matrix *W* of *G(V, E)*;

2: Calculate degree matrix *D* of *W*;

3: ; //Compute normalized Laplacian

3: Compute the first *k* eigenvectors * _{1},… u_{k}* of

*L*;

4: Let *U* be the matrix containing the vectors *u _{1}, … , u_{k}* as columns;

5: Use the *k*-means algorithm to cluster *U*, then get the clusters *C*;

6: return *C*;

In practice, it is difficult to determine a suitable number of clusters for road segment clustering. Therefore, a modified clustering algorithm is proposed, and the average weight *w*_{av} of a cluster, instead of the cluster number *k*, is set as a constraint for controlling the clustering process. In order to simplify the computation, *w*_{av} is defined as the average weight between the centroid of a cluster and other objects. The centroid is given by
(4)
where *V*_{k} denotes the vertices of the *k*-th cluster, *|V*_{k}*|* is the number of vertices in *V*_{k}, *v*_{i} is the *i*-th vertex of *V*_{k}, and *vc*_{k} is the centroid of *V*_{k},. Then, the average weight of *V*_{k} can be expressed as
(5)

In the algorithm (Box 2) based on the constraint *ω*, the vertices of *G* are divided into small clusters step by step, and the cluster whose *w*_{av} is greater than the threshold value *ω* should be divided into smaller clusters until the constraint is met. Because clusters with only one object are meaningless, it is reasonable to set a minimum number of objects in a cluster (*N*_{min}). A small value *k* is set as the number of clusters in every clustering step. In this study, both *k* and *N*_{min} are set as 2.

### Box 2. Clustering algorithm based on constraint.

Algorithm 2 ConstraintClustering: Clustering based on constraint

**Input:** *G(V*, *E)*: Traffic condition graph; *ω*: Threshold value; *k*: Number of clusters at every step; *N*_{min}: Minimum number of objects in a cluster;

*Output*:*C = {V*_{1},*V*_{2},*…V*_{K}*}*: *clusters;*

1: **if** *|V|>N*_{min} and *|V|>k*

2: *C ←* SpectralClustering(*G*, *k*);

3: *C*_{temp} *← ∅*;

4: **for** *i ← 1* to *k* do

5: **if** the average weight *w*_{av} of *V*_{i} is greater than *ω*

6: Get sub-graph *G*_{i} from *G* corresponding to *V*_{i};

7: *C*_{temp} *← C*_{temp} ∪ SpectralClustering(*G*_{i}, *k*);

8: **else**

9: *C*_{temp} *← C*_{temp} ∪ *{V*_{i}*}*;

10: **end if**

11: **end for**

12: *C ← C*_{temp};

13: **else**

14: *C ← {V}*;

15: **end if**

16: return *C*;

#### Pattern mining.

To avoid coincidental clusters, it is reasonable to perform clustering multiple times for different days and find frequent clusters with a better representation of traffic similarity between road segments. In general, the traffic condition exhibits different patterns on weekdays and weekends. As shown in Fig 3, the traffic conditions of arterial roads in Beijing have similar characteristics on weekdays but significantly different characteristics on weekends. Therefore, the traffic conditions should be discussed separately.

The road segment set, *R = {r*_{n} *|n = 1*, *2*, *…*, *N}*, is divided into cluster set *C*_{d} *= {R*_{k} *|k = 1*, *2*, *…*, *K}* according to the traffic condition of the *d*-th day using the clustering algorithm proposed in Section 3.2.1. The cluster set list, *L = {C*_{d} *|d = 1*, *2*, *…*, *D}*, contains all clusters of the last *D* days (weekdays or weekends) from the target day. The objective of the frequent pattern mining approach adopted in this study is to find the frequent cluster set, *P = {R*_{j}*⊂R |j = 1*, *2*, *…*, *J}*, where the cluster *R*_{j} appears frequently in *L*. An indicator function is used to indicate whether *R*_{j} appears in the cluster set *C*_{d}:
(6)

In this study, the number of times the cluster *R*_{j} appears in the cluster set *L* is defined as the support, and it can be calculated as follows:
(7)

The frequent cluster must meet the minimum support, *Sup*_{min}; then, the frequent cluster set can be defined as follows:
(8)

It is difficult for traditional pattern mining algorithms (e.g., the Apriori algorithm) to compute and find the frequent clusters, as the number of road segments and clusters is extremely large. To overcome this problem, a frequent pattern mining approach based on intersection is proposed. The intersection between two cluster sets is expressed as
(9)
where *C*_{1} and *C*_{2} are the cluster sets of two days, *R*_{i} and *R*_{j} are arbitrary clusters of sets *C*_{1} and *C*_{2}, respectively, and *R*_{k} is the intersection of *R*_{i} and *R*_{j}. Note that *R*_{k} is meaningful for estimation only if it includes more than one road segment. Therefore, in the intersection process, the cluster *R*_{k} will be discarded when *| R*_{k} *|<2*. As shown in Fig 4, the frequent cluster set can be obtained by gradual intersection.

For a cluster set list *L*_{i}, which has *i* cluster sets, the frequent clusters that appear *i* times in *L*_{i} can be found by a recursive algorithm based on intersection; this algorithm can be expressed as
(10)
where *C*_{j} is an arbitrary cluster set in the list *L*_{i}; if *L*_{i} contains only one cluster set *C*_{j}, then *Pattern(L*_{i}*)* will return *C*_{j}. In order to find all frequent clusters that appear *i* times in *L*, it is necessary to obtain the combination set, given by
(11)
where includes all combinations that contain *i* cluster sets of *L*; the number of combinations is . Then, the frequent cluster set can be obtained as
(12)
where *FC*_{i} contains all frequent clusters whose support is *i*. Then, all frequent clusters that appear more than *Sup*_{min} times can be obtained using the following algorithm (Box 3):

### Box 3. Traffic frequent pattern mining algorithm.

Algorithm 3 TrafficPatternMining: Traffic frequent pattern mining

**Input:** *L = {C*_{1}, *C*_{2}, *…*, *C*_{D}*}*; *Sup*_{min}: minimum support

**Output:** *FC = {FC*_{i}*|i = Sup*_{min}, *Sup*_{min} *+1*, *…*, *D}*;

1: *FC ← ∅*;

2: **for** *i ←Sup*_{min} to *D*

3: *FC*_{i} *← ∅*;

4: *L*_{com} *← i-combinations from L*;

5: *FC*_{i} *←* Get all frequent clusters from (12);

6: *FC ← FC ∪ {FC*_{i}*}*;

7: **end for**

8: return *FC*;

#### Multi-clustering strategy.

As discussed in Section 3.2.1, the smaller the value of the constraint *ω*, the more similar are the traffic characteristics of the road segments in the same cluster; this may improve the estimation accuracy. However, the frequent cluster set covers fewer road segments because of the more stringent constraint. To resolve this conflict, a multiple-clustering strategy is adopted. Multiple constraints *ω*_{1}, *ω*_{2},*…ω*_{m} are selected for clustering and pattern mining; then, a list of frequent cluster sets, *FCL = {FC*_{ω} *|ω = ω*_{1}, *ω*_{2}, *…*, *ω*_{m}*}*, is generated, where *FC*_{ω} is the frequent cluster set corresponding to the constraint *ω*. In the process of finding the cluster that contains the road segment *r*, the frequent cluster set with smaller *ω* should be considered first.

### Probability calculation

#### Emission probability.

For a road segment *r* that belongs to the frequent cluster *C*, its traffic condition *x*^{r,t} at time slot *t* approximates the traffic conditions *y*^{r,t} of other road segments in *C*. Thus, the difference *diff* between *x*^{r,t} and *y*^{r,t} can be adopted to calculate the emission probability *Pr(y*^{r,t}*|x*^{r,t}*)*. According to observations, the emission probability follows an exponential distribution; hence, it can be calculated as
(13)
where, *λ* is the parameter of the exponential distribution (*λ>0*) and*diff* is the average traffic condition difference between the road segment *r* and road segments in *C' = {r'∈C| r'≠r}*.

In order to find the appropriate frequent cluster *C*, the frequent cluster set *FC*_{ω} with a smaller constraint value *ω* should be considered preferentially, and in the frequent cluster set *FC = {FC*_{i} *|i = Sup*_{min}, *Sup*_{min} *+1*, *…*, *D}*, the clusters that appear more frequently should be considered first.

#### Transition probability.

Through data analysis, in a relatively short time interval, the traffic condition at time slot *t+1* is close to that at time slot *t*. Thus, the traffic state change *∆x = |x*^{r,t}*—x*^{r,t+1}*|* is employed to measure the state transition. According to observations, the state transition probability follows an exponential distribution, and it can be expressed as
(14)
where, *β* is the parameter of the exponential distribution (*β>0*).

### Candidate selection

As discussed in Sections 3.2 and 3.3, the state *x*^{r,t+1} may approximate the previous state *x*^{r,t} and the observations *y*^{r,t+1} *= {x*^{i,t+1}*|i∈C and i≠r}*, where *C* is the frequent cluster that contains the road segment *r*. Then, the value range of *x*^{r,t+1} is set as *[x*_{min}*-μ*, *x*_{max}*+μ]*, where *x*_{min} = *min({x*^{r,t}*}*∪*y*^{r,t+1}*)*, *x*_{max} = *max({x*^{r,t}*}*∪*y*^{r,t+1}*)*, and *μ* is used to avoid missing valid values. For computational convenience, the range should be discretized to finite candidates denoted by the set
(15)
where *N*_{cand} is the number of candidates. In order to facilitate algorithm design, the candidate set is *CS* = *{x*^{r,t+1}*}* when the state at time slot *t+1* is obtained from the samples. Then, it is not necessary to find the missing state sub-sequences discussed in Section 2; the entire state sequence can be estimated in a single process.

### Real-time algorithm

For the road segment *r* at time slot *t*, a list *PreSList = {Se*_{i} *|i = 1*,*2*,*…m}* is used to store previous surviving state sequences. A sequence is denoted by *Se = (SS*, *JP)*, where *SS = {x*^{r,1},*x*^{r,2},*…*,*x*^{r,t-1}*}* stores previous consecutive candidate states and *JP* is the joint probability. Using Algorithm 4 (Box 4), the current candidate sequence list *SList* is obtained according to *PreSList* and the candidate states *CS*_{t} of road segment *r* at time slot *t*. Then, the state sequence with the maximum joint probability in *SList* is the optimal solution of road segment *r* at time slot *t*; the sequence is given by *argmax*_{Se∈SList}*{Se*.*JP}*. For the first state, the initial joint probability of the state sequence is the emission probability of the candidate states. Obviously, the algorithm can output the estimated states in real time; thus, it is applicable to online application.

### Box 4. Real-time traffic estimation algorithm based on HMM.

Algorithm 4 TrafficEstimation: Real-time traffic estimation based on HMM

**Input:** *PreSList*: List of surviving sequences of road segment *r* at time slot *t-1*; *t*: time.

**Output:** *SList*: List of surviving sequences of road segment *r* at time slot *t*.

1: *SList ← PreSList*; //*SList* is a current surviving sequence list

2: *CS*_{t} *←* Get candidate states; //Discussed in Section 3.4;

3: *SListTemp ← ∅*;

4: **if** *t = = 1*

5: **for** *x* in *CS*_{t}

6: Construct a new state sequence *Se*; Set *x* as the starting state;

7: *Se*.*JP ← Pr(y*^{r,t}*|x)*; //Discussed in Section 3.2;

8: Add *Se* into *SList*;

9: **end for**

10: **else**

11: **for** *x* in *CS*_{t}

12: *Se* ← *argmax*_{Se∈SList}{*Se*.*JP*∙*Pr*(*x*^{r,t−1},*x*)};

13: Set *x* as the *t*-th state of *Se*;

14: *Se*.*JP* ← *Pr*(*y*^{r,t}|*x*)∙*Se*.*JP*∙*Pr*(*x*^{r,t−1},*x*);

15: Add *Se* to the temporary list *SListTemp*;

16: **end for**

17: *SList* ← *SListTemp*;

18: **end if**

19: output *argmax*_{Se∈SList}{*Se*.*JP*}; //Real-time output current solution;

20: return *SList*;

## Results and Discussion

For the experiments, 8559 arterial road segments were selected; the roads cover the main regions of central Beijing. The traffic conditions between 6:00 and 24:00 were considered, and the time was divided into 108 time slots at 10-min intervals (e.g., the first time slot was 6:00–6:10 and the 12^{th} time slot was 7:50–8:00).

The taxi trajectory data in Beijing during November 2012 served as the FCD data, obtained from 12,600 taxis. The data samples of six weekdays were selected for a case study; five of these days were used for frequent cluster mining and parameter estimation, and the remaining day was used to test the estimation model. Before the experiments were performed, the trajectory data were matched to the road network using map-matching methods [34–36], and anomalous samples were eliminated.

The model was implemented using a Java platform on a computer having a quad-core CPU (2.2 GHz) and 8-GB memory.

### Frequent cluster mining

Six constraint values *{*ω |ω = *10*, *15*, *20*, *25*, *30*, *35}* were considered in the clustering analysis stage. The average weight *w*_{av} discussed in Section 3.2.1 was employed to measure the degree of similarity, which decreased as *w*_{av} increased. As shown in Table 1, the mean *w*_{av} of the cluster set and the average number of objects in each cluster, *ON*_{average}, increase with ω. Clusters having a single object cannot be used for estimation; the proportion of such clusters, *r*_{single}, decreases as ω increases. For traffic estimation, a perfect cluster set has small average *w*_{av}, large *ON*_{average}, and small *r*_{single}. A cluster set having small average *w*_{av} is more likely to have small *ON*_{average} and large *r*_{single}, which confirms the existence of the contradiction discussed in Section 3.2.3. Therefore, it is necessary to adopt a multi-clustering strategy.

As shown in Fig 5, the traffic characteristics of the road segments in the same cluster have a very high degree of similarity when the average weight *w*_{av} is small, such as clusters a, b, and c. As the average weight *w*_{av} increases, the degree of similarity of the cluster decreases and the number of the objects in the cluster increases.

(a) The average weight of cluster a is 4.71, (b) the average weight of cluster b is 9.75, (c) the average weight of cluster c is 14.58, and (d) the average weight of cluster d is 24.93.

Samples of five days were selected for frequent cluster mining, and the minimum support *Sup*_{min} was set as 3. Table 2 lists the coverage rates of the frequent clusters, which is given by the ratio *R*_{cover} *= N*_{cover}*/N*_{total}, where *N*_{cover} is the number of road segments in the frequent cluster set and *N*_{total} is the total number of road segments. The coverage rate increases with ω, and the support of the most frequent clusters is less than or equal to 4. When ω increases to 35, the coverage rate of the frequent cluster set reaches 96.76%, which indicates that the set of these ω is sufficient and appropriate for this study.

The road segments that are adjacent to each other may have similar traffic characteristics; this property can be used instead of clustering for finding similar road segments. However, in contrast to our assumption, this is not very likely in practice. As shown in Fig 6, although the proportion of road segments whose adjacent segments are in the same cluster increases with ω, this proportion is still low. Therefore, it is more reasonable to find similar road segments by clustering rather than by adjacency relationships.

### Parameter estimation

Statistical analysis of the distribution of *diff* was carried out in order to estimate the parameter *λ* in (13). In different frequent cluster sets corresponding to specific values of *ω*, the distribution of *diff* is different. In order to observe the distribution of *diff*, we calculated the ratio of each *diff* value to the total number of samples. As shown in Fig 7, the steepness of the distribution curve increases with *ω*, which indicates that the road segments in the frequent cluster generated on the basis of a smaller *ω* are more likely to have a higher degree of similarity, because the probability that *diff* takes a smaller value is higher.

The parameter *λ* was calculated as *1/E(diff)*, where *E(diff)* is the expectation of *diff*, and the equation was initialized with an initial parameter *λ*^{*}; then, the parameter was learned by iterative computation until it converged to a specific value. Table 3 lists *λ* values for six frequent cluster sets, *FCL = {FC*_{ω} *|ω = 10*, *15*, *20*, *25*, *30*, *35}*, as well as the sum of squared errors (SSE), root-mean-square error (RMSE), and R-square, which indicate that the equation works well for the samples.

As shown in Fig 8, the traffic state change *∆x* follows an exponential distribution. The parameter *β* is calculated as *1/E(∆x)*, where *E(∆x)* is the expectation of *∆x*; after iterative computation, the estimated values of *β*, SSE, RMSE, and R-square are 0.09451, 0.000528, 0.002436, and 0.9871, respectively.

### Model accuracy and efficiency

A state sequence set of 8559 arterial road segments was prepared for testing. Because it is difficult to obtain the complete state set of these road segments, the states that were actually obtained were considered for accuracy analysis; the total number of these states, *N*_{base}, was 6.19×10^{5}. Among these states, a number of states were randomly selected as the missing states that need to be estimated. The missing state rate is denoted by *R*_{miss} *= N*_{miss}*/N*_{base}, where *N*_{miss} is the number of missing states. The mean absolute error (MAE) was employed to measure the estimation accuracy, and it is given by
(16)
where *N*_{estim} is the number of states estimated, is the estimated value of the *i*-th state, and *x*_{i} is the true value of the *i*-th state.

The accuracies of two models, Model 1 and Model 2, were compared. Model 1 is the proposed model, which finds road segments with similar traffic states via clustering and frequent pattern mining, whereas Model 2 assumes that adjacent road segments have similar traffic conditions. As shown in Fig 9, MAE increases with *R*_{miss}, and the MAE of Model 2 is significantly higher than that of Model 1, which implies that the proposed model is more accurate.

If there is no reference state, such as the previous state or the states of similar road segments, for the state *x*^{r,t}, the state cannot be estimated. The rate of the states that cannot be estimated is *R*_{miss}*-R*_{estim}; then, the number of states that can be estimated is *R*_{valid} *= 1*-(*R*_{miss}*-R*_{estim}), where *R*_{estim} = *N*_{estim}*/N*_{base}. As shown in Table 4, MAE increases gradually before *R*_{miss} reaches 83.33%, and *R*_{valid} remains high until *R*_{miss} reaches 92.59%, which indicates that the model is applicable to very sparse sample data.

The cumulative distribution function (CDF) of the estimation error *E* is given by
(17)
where estimation error *E* is the absolute value of the difference between the estimated and observed values, *N*_{estim} is the number of estimated states, and *N(E ≤ e)* is the number of estimated states whose error is less than or equal to *e*. Fig 10 shows the CDFs of estimation errors corresponding to different values of the missing state rate *R*_{miss}. Before *R*_{miss} reaches 83.33%, the CDF curve is steeper, which indicates that most errors are small. For example, when *R*_{miss} = 83.33%, more than 52.79% of the errors are less than or equal to 5 and more than 76.88% of the errors are less than or equal to 10.

The states that are obtained are considered as the true states; then, the estimated error distribution function in the global scope is given by (18)

Table 5 summarizes the global distribution of the estimated error, which reflects the accuracy of the model corresponding to different values of the missing state rate *R*_{miss}. According to the error distribution, it is easy to determine whether the accuracy of the model meets the requirements of the application. For example, in an application that requires 90% of the errors are to be less than 5 km/h, if the missing state rate is less than 18.52%, then the model may be suitable for the application.

In our data source, the missing state rate of the arterial roads was around 33%, while the missing state rate of the other roads was around 65%. Samples of 8559 arterial roads were considered. The error was less than or equal to 5 (resp. 10) for more than 84.84% (resp. 92.61%) of the states; this indicates a high estimation accuracy. Fig 11(A) shows the traffic state map of the arterial roads in Beijing at the 50^{th} time slot (14:10–14:20) before estimation, and Fig 11(B) shows the states of the roads after estimation. Most of the missing states were estimated, and the estimated values were very close to the true values.

(a) Traffic condition map before estimation, and (b) traffic condition map after estimation; the black regions represent missing states.

The main factor that affects the efficiency of the model is the number of candidates for the hidden states, *N*_{cand}, which has been discussed in Section 3.4. Several values of *N*_{cand} were selected for the experiments, where the missing state rate was around 74%. The results show that the accuracy improved as *N*_{cand} increased; however, the time cost increased significantly (Table 6). When *N*_{cand} reached around 12, the accuracy stabilized and the model could estimate approximately 4169.22 states per second. From the viewpoint of practical application, the model can meet the efficiency requirements of metropolitan real-time traffic estimation.

## Conclusion

This paper presented an effective and efficient HMM-based model for urban-scale traffic estimation using floating car data. Clustering analysis and pattern mining were adopted to analyze a large data set of real probe data collected from a fleet of 12,600 taxis in Beijing, China, and it was found that there exist frequent clusters in which the road segments have similar traffic characteristics. Comparative analysis showed that the model based on clustering is more effective than the model based on adjacency relationships for traffic estimation. In order to achieve a trade-off between clustering accuracy and coverage, a multi-clustering strategy was adopted in the estimation process. Experimental results showed that the model can be applied to different scenarios; even when more than 70% of the original data are missing, the model can guarantee that more than 80% of the states have relatively small errors. In addition, the model was implemented using a real-time algorithm, which offers higher precision and has a broader scope for application than some offline traffic estimation algorithms.

## Author Contributions

Conceived and designed the experiments: XW LP TC. Performed the experiments: XW ML XY JS. Analyzed the data: XW ML XY JS. Contributed reagents/materials/analysis tools: XW ML XY JS. Wrote the paper: XW ML XY JS.

## References

- 1. Kong QJ, Zhao QK, Wei C, Liu YC. Efficient Traffic State Estimation for Large-Scale Urban Road Networks. Ieee T Intell Transp. 2013 Mar;14(1):398–407. doi: 10.1109/tits.2012.2218237
- 2. Leontiadis I, Marfia G, Mack D, Pau G, Mascolo C, Gerla M. On the Effectiveness of an Opportunistic Traffic Management System for Vehicular Networks. Intelligent Transportation Systems, IEEE Transactions on. 2011;12(4):1537–48. doi: 10.1109/tits.2011.2161469
- 3. Yeon J, Elefteriadou L, Lawphongpanich S. Travel time estimation on a freeway using Discrete Time Markov Chains. Transportation Research Part B: Methodological. 2008;42(4):325–38. doi: 10.1016/j.trb.2007.08.005
- 4.
Bramberger M, Brunner J, Rinner B, Schwabach H. Real-time video analysis on an embedded smart camera for traffic surveillance. Real-Time and Embedded Technology and Applications Symposium, 2004 Proceedings RTAS 2004 10th IEEE; 2004 May 25–28; 2004. p. 174–181.
- 5.
Herrera JC, Bayen AM. Traffic Flow Reconstruction Using Mobile Sensors and Loop Detector Data. TRB 87th Annual Meeting Compendium; 2007.
- 6. Hiribarren G, Herrera JC. Real time traffic states estimation on arterials based on trajectory data. Transport Res B-Meth. 2014 Nov;69:19–30. doi: 10.1016/j.trb.2014.07.003
- 7. Herrera JC, Work DB, Herring R, Ban X, Jacobson Q, Bayen AM. Evaluation of traffic data obtained via GPS-enabled mobile phones: The Mobile Century field experiment. Transportation Research Part C: Emerging Technologies. 2010;18(4):568–83. doi: 10.1016/j.trc.2009.10.006
- 8.
de Fabritiis C, Ragona R, Valenti G. Traffic Estimation And Prediction Based On Real Time Floating Car Data. Intelligent Transportation Systems, 2008 ITSC 2008 11th International IEEE Conference on; 2008 Oct 12–15; 2008. p. 197–203.
- 9. Bar-Gera H. Evaluation of a cellular phone-based system for measurements of traffic speeds and travel times: A case study from Israel. Transportation Research Part C: Emerging Technologies. 2007;15(6):380–91. doi: 10.1016/j.trc.2007.06.003
- 10. Breitenberger S, Grueber B, Neuherz M, Kates R. Traffic information potential and necessary penetration rates. Traffic Engineering & Control. 2004;45(11):396–401.
- 11. Vandenberghe W, Vanhauwaert E, Verbrugge S, Moerman I, Demeester P. Feasibility of expanding traffic monitoring systems with floating car data technology. Iet Intell Transp Sy. 2012 Dec;6(4):347–54. doi: 10.1049/iet-its.2011.0221
- 12.
Herring R, Hofleitner A, Abbeel P, Bayen A. Estimating arterial traffic conditions using sparse probe data. Intelligent Transportation Systems (ITSC), 2010 13th International IEEE Conference on; 2010 Sept 19–22; 2010. p. 929–36.
- 13. Zhu YM, Li Z, Zhu HZ, Li ML, Zhang Q. A Compressive Sensing Approach to Urban Traffic Estimation with Probe Vehicles. Ieee T Mobile Comput. 2013 Nov;12(11):2289–302. doi: 10.1109/tmc.2012.205
- 14.
Bejan AI, Gibbens RJ. Evaluation of velocity fields via sparse bus probe data in urban areas. Intelligent Transportation Systems (ITSC), 2011 14th International IEEE Conference on; 2011 Oct 5–7; 2011. p. 746–53.
- 15. Seo T, Kusakabe T, Asakura Y. Estimation of flow and density using probe vehicles with spacing measurement equipment. Transport Res C-Emer. 2015 Apr;53:134–50. doi: 10.1016/j.trc.2015.01.033
- 16. Fowe AJ, Chan YP. A microstate spatial-inference model for network-traffic estimation. Transport Res C-Emer. 2013 Nov;36:245–60. doi: 10.1016/j.trc.2013.08.011
- 17. Ramezani M, Geroliminis N. Queue Profile Estimation in Congested Urban Networks with Probe Data. Comput-Aided Civ Inf. 2015 Jun;30(6):414–32. doi: 10.1111/mice.12095
- 18. Tang TQ, Shi WF, Shang HY, Wang YP. An extended car-following model with consideration of the reliability of inter-vehicle communication. Measurement. 2014; 58:286–293. doi: 10.1016/j.measurement.2014.08.051
- 19. Gupta A.K.; Sharma S.; Redhu P. Analyses of lattice traffic flow model on a gradient highway. Commun Theor Phys. 2014; 62:393–404. doi: 10.1088/0253-6102/62/3/17
- 20. Gupta A.K.; Redhu P. Jamming transition of a two-dimensional traffic dynamics with consideration of optimal current difference. Phys Lett A. 2013; 377:2027–2033. doi: 10.1016/j.physleta.2013.06.009
- 21. Gupta A.K. A section approach to a traffic flow model on networks. Int J Mod Phys C. 2013,24. doi: 10.1142/s0129183113500186
- 22. Tang TQ, Li CY, Huang HJ. A new car-following model with the consideration of the driver’s forecast effect. Physics Letters A. 2010; 374(38):3951–3956. doi: 10.1016/j.physleta.2010.07.062
- 23. Yu SW, Shi ZK. Dynamics of connected cruise control systems considering velocity changes with memory feedback. Measurement. 2015; 64:34–48. doi: 10.1016/j.measurement.2014.12.036
- 24. Ge J, Orosz G. Dynamics of connected vehicle systems with delayed acceleration feedback. Transportation Research Part C. 2014; 46:46–64. doi: 10.1016/j.trc.2014.04.014
- 25. Tang TQ, Huang HJ, Shang HY. A dynamic model for the heterogeneous traffic flow consisting of car, bicycle and pedestrian. International Journal of Modern Physics C. 2010; 21:159–176. doi: 10.1142/s0129183110015038
- 26. Yu SW, Shi ZK. An extended car-following model considering vehicular gap fluctuation. Measurement. 2015; 70:137–147. doi: 10.1016/j.measurement.2015.03.031
- 27. Yu SW, Shi ZK. An extended car-following model at signalized intersections. Physica A, 2014; 407:152–159. doi: 10.1016/j.physa.2014.03.081
- 28. Yu SW, Shi ZK. An improved car-following model considering headway changes with memory. Physica A, 2015; 421:1–14. doi: 10.1016/j.physa.2014.11.008
- 29. Yu S, Liu Q, Li X. Full velocity difference and acceleration model for a car-following theory. Communications in Nonlinear Science & Numerical Simulation. 2013; 18(5):1229–1234. doi: 10.1016/j.cnsns.2012.09.014
- 30. Wang JW, Wang YS, Yun MP, Yang XG. Development of Urban Road Network Traffic State Dynamic Estimation Method. Math Probl Eng. 2015. doi: 10.1155/2015/714149
- 31. Work DB, Blandin S, Tossavainen O-P, Piccoli B, Bayen AM. A Traffic Model for Velocity Data Assimilation. Applied Mathematics Research eXpress. 2010 January 1; 2010(1):1–35. doi: 10.1093/amrx/abq002
- 32. Hofleitner A, Herring R, Abbeel P, Bayen A. Learning the Dynamics of Arterial Traffic From Probe Data Using a Dynamic Bayesian Network. Intelligent Transportation Systems, IEEE Transactions on. 2012;13(4):1679–93. doi: 10.1109/tits.2012.2200474
- 33. von Luxburg U. A tutorial on spectral clustering. Stat Comput. 2007;17(4):395–416. doi: 10.1007/s11222-007-9033-z
- 34. Chen BY, Yuan H, Li QQ, Lam WHK, Shaw SL, Yan K. Map-matching algorithm for large-scale low-frequency floating car data. Int J Geogr Inf Sci. 2014 Jan 2;28(1):22–38. doi: 10.1080/13658816.2013.816427
- 35. He ZC, She XW, Zhuang LJ, Nie PL. On-line map-matching framework for floating car data with low sampling rate in urban road networks. Iet Intell Transp Sy. 2013 Dec;7(4):404–14. doi: 10.1049/iet-its.2011.0226
- 36.
Raymond R, Morimura T, Osogami T, Hirosue N. Map matching with Hidden Markov Model on sampled road network. International Conference on Pattern Recognition (ICPR); 11–15 Nov. 2012; Tsukuba: IEEE; 2012. p. 2242–5.