A geographical location prediction method based on continuous time series Markov model

Trajectory data uploaded by mobile devices is growing quickly. It represents the movement of an individual or a device based on the longitude and latitude coordinates collected by GPS. The location based service has a broad application prospect in the real world. As the traditional location prediction models which are based on the discrete state sequence cannot predict the locations in real time, we propose a Continuous Time Series Markov Model (CTS-MM) to solve this problem. The method takes the Gaussian Mixed Model (GMM) to simulate the posterior probability of a location in the continuous time series. The probability calculation method and state transition model of the Hidden Markov Model (HMM) are improved to get the precise location prediction. The experimental results on GeoLife data show that CTS-MM performs better for location prediction in exact minute than traditional location prediction models.


Introduction
Location-based Service (LBS) is a kind of information service that provides users geographical positions located by mobile devices and wireless network. There exists wealth information in the location data, such as user's interests, user's hobbies and user's behavior pattern. LBS may be employed in a number of applications, including: location-based advertising [1], personalized weather services, entertainment [2], personal life and so on. An effective location prediction or recommendation can make the users have good experience.
The advances in location-acquisition and mobile communication technologies empower people to use location data with existing online social networks in a variety of ways. People can share their present location, record travel routes with GPS to share travel experiences in Geo-Life [3]. Zheng [4] gives the overview of trajectory data mining, including the trajectory data preprocessing, pattern mining and classification, it explores the connections, correlations and differences among these existing techniques and also some public trajectory datasets are presented. Zheng [5] puts forward the approach to find the top-k candidate trips within the uncertain trajectory data. The historical data is used to inference the travel trip and it reduces the uncertainty of the user's trajectory. PLOS  In order to improve the location service experience, it is needed to know the user's location in advance. For example, if it can be predicted that the user will appear location B at 6:00pm based on the previous locations visited, the LBS provider can send the recommendation information or advertisements for restaurants in location B to the user in advance. Xue [6] puts forward the SubSyn algorithm for location prediction. The user's historical trajectory data is decomposed into the set of sub trajectory, which increases the number of tracks and the training data size, and the prediction performance is improved. As for the track prediction, the method of Markov Model [7,8] is widely used and its central idea is to build the Markov chain for speculation. The algorithm of SMLP (Social-aware Mobile user Location Prediction algorithm) is put forward by Yu [9]. It integrates the user's correlation with Markov Model for location prediction. Although the algorithm requires less space than the Markov model, the prediction results are heavily affected by the region division. Lian [10] proposes the CEPR(Collaborative Exploration and Periodically Returning model) algorithm which adopts collaborative filtering technique and the user historical behavior is used for location prediction and recommendation. They also give the correlation analysis [11] between the user statistical information and location predictability on the data of Gowalla(https://snap.stanford.edu/data/locgowalla.html). The prefix tree and heuristic search strategy are adopted to implement the personalized trip recommendation by Zhang [12]. Wu [13] uses Markov Random Field to predict the annotation of location records and the user's destiny, the better performance is achieved when there exists more user's records. Nghia [14] uses matrix factorization to select features and predicts the user's location. Although the algorithm can predict geographic location on real time, their dataset is composed of tweets containing lots of semantic information. The results are influenced by people's subjective emotions and expressions.
Gambs [15] extends a mobility model called Mobility Markov Chain (MMC) to incorporate the n previous visited locations for next location prediction. However, it cannot predict location within any time interval. Mathew [16]presents a hybrid method for predicting human mobility on the basis of Hidden Markov Models (HMMs). They use forward algorithm to compute the probability of possible sequences and return the next place from the sequence with the highest probability. But the experimental results on GeoLife is not satisfied with the highest Precision@5 of 26.40. Qiao [17] proposes a hybrid Markov based prediction model that contains three stages: mobility pattern discovery, variable-order Markov predictor and mobility pattern based users similarity calculation. The human trajectory data is extracted from data traffic of an LTE(Long Term Evolution) network. The extensive experimental evaluations should be conducted to compare with other related work on different datasets. Huang [18] proposes a predictive model taking into account activity changes. It is implemented for two users selected from the GeoLife dataset and gets performance improvement. The study results are limited by the spatial and temporal coverage of the dataset used and it should be applied to predict human movement by different days of the week with better quality data.
In addition, many other approaches are also used to build the prediction model, such as the association rules based method [19] and so on. However, all of these existing strategies cannot give the prediction based on the real time.
GeoLife(https://www.microsoft.com/en-us/download/details.aspx?id=52367) is the commonly used data for location based service, which records a broad range of users' outdoor movements, including not only life routines but also some entertainments and sports activities. This trajectory dataset can be used in many research fields, such as mobility pattern mining, user activity recognition, location-based social networks and location recommendation.
In this paper, we address the issue of predicting the user's location on the continuous time series based on the historical trajectory data and give the improvement to the original Markov model. The discrete time sequence is simulated to the continuous sequence by Gaussian Mixture Model.

Location prediction structure
The potential location is discovered by filtering and clustering technique on large scale tracing point data. The structure of location prediction based on the real time series is shown in Fig 1. The user's trajectory data, which is represented by <Time, Location> series, is filtered and clustered to produce a series of candidate location. Gaussian mixture modeling is implemented on the serial density of each location. Combined with the transition probability matrix, the original markov model is improved to the new model which is based on the continuous time series. The symbols used in this paper are shown in Table 1.

Tracing data filtering
The noise data should be filtered to reduce the interference. The filtering algorithm retains the tracing points which maybe selected as the candidate location and discards the irrelevant tracing points. Three kinds of filtering strategies are shown in the following. The drifting phenomenon is caused by the strength of GPS signal and the satellite switching. The filtering rule is shown in the following: 2. Filtering the tracing point with a higher speed than the average walking speed. The pedestrian speed is relatively slow usually in the candidate location where the traffic is Table 1. The symbol definition.

Symbol Definition T
The continuous trajectory data series with a starting tracing point and an ending tracing point. T i j denotes the jth tracing point within the ith trajectory.

L
The location area with a large number of tracing points by clustering, such as business area and park. L i denotes the ith Location.

T
The set of user's sign-in time.

U
The set of time for the possible location transition.
@ Threshold for the walking time.
The kth trajectory from location L i to L j .

M
Probability matrix for location transition. greater than the regional carrying capacity. On the other hand, there exists the activity which can only be carried at lower speed, such as playing and watching.
The filtering rule is: Here, η value is set to 1.5m/s which is faster than the average walking speed.
3. Filtering the trajectory with shorter residence time.
The trajectory with shorter residence time has a little chance to be the part of the candidate location. The filtering rule: Here, N denotes the tracing point number in trajectory T i and Dt 0;N ðT i Þ represents the residence time for trajectory T i . The threshold @ value is set to 20 minutes which is also used by Huang [18] and Zheng [20] to extract meaningful human activities.
After filtering the noisy tracing point, we adopt the DBSCAN clustering algorithm to discover the candidate location. K-means and DBSCAN are the most commonly used algorithms for location clustering. However, K-means algorithm needs to set the number of clusters in advance, and it is not suitable for the experimental dataset GeoLife with the larger size and dispersion. We also try the Birch algorithm which cannot distinguish the tracing points by walking. DBSCAN is a density-based spatial clustering algorithm. It groups points together that are closely packed. The clusters with different shapes can be discovered and it is not needed to determine the cluster number in advance, and so DBSCAN algorithm is better to mine the potential locations.

GMM
The prediction model of a user's location L r in the time t is shown in formula 2.
Here, L r represents the location with maximum probability value at time t. It is difficult to compute P(L k |t) because of the continuity of time t, and we transform it to formula 3.
Here, P(t) denotes the sign-in probability for the time t and P(t|L k ) denotes the sign-in distribution for location L k within a day. P(L k ) denotes the sign-in probability of location L k . Therefore, the location prediction model is shown in formula 4.
The sign-in timet in the dataset is discrete, the GMM is used for modeling P(t) on the continuous time t by formula 5. The Gaussian Mixture Model is a combination of multiple Gaussian distributions. Here, α i denotes the weight of each Gaussian distribution in GMM.
Here, Nðt; m i ; s 2 i Þ is the Gaussian distribution in the time t and it is computed by formula 6.
It is similar to compute P(t|L k ) by the modeling with GMM shown in formula 7.
Here, t (k) represents the sign-in time for location L k . P(L k ) is computed by formula 8.
Here, |L k | denotes the tracing point number for location L k and ∑ i |L i | denotes the total tracing point number for all locations. Finally, the location prediction model is shown in formula 9.

GMM training algorithm
The sign-in frequency for different location L k is various and so the Gaussian distribution in formula 6 is different. It is needed to adjust dynamically. The algorithm is shown in Table 2.
Here, EM(T i ! ,d,e) represents that the GMM parameters are achieved by EM algorithm [21]. The Expectation-Maximization (EM) algorithm is an iterative method to find maximum likelihood of parameters in statistical models, where the model depends on unobserved latent variables. In our paper, EM algorithm is used to solve the parameters of Gaussian Mixture Model. If the Gaussian Mixture coefficient is lower than the threshold, the number of Gaussian distribution will be decreased and the Maximum error value will be increased for retraining. Here, we set the p_threshold to 0.01. Although GMM can predict the location by the use of time information, it can't take into account the context of the trajectory data which limits the prediction performance.

Location prediction by Markov Model based on the continuous time series(CTS-MM)
Markov model is a stochastic model which assumes that future states depend only on the current state, not on the events that occurred before it. The standard Markov Model cannot give the location prediction based on continuous time series. It only takes into account the context of the trajectory data without the time information. And furthermore, the discrete time series is needed to transform to the continuous time which is simulated by the Gaussian Mixture Distribution with EM algorithm. We put forward the Markov Model based on the continuous time series (CTS-MM), which considers not only the geographic feature in trajectory data but also the time feature.

CTS-MM
It is known that the user visits location L i in the time t and we want to predict the user's next location aftertime interval Δt. The model is shown in formula 10.
The CTS-MM is used to modeling the user's visiting sequences, which is shown in

Algorithm 2. GMM Training Algorithm for Different Location
Input:  // Gaussian Model number is more than 1 and min(b i ! ) is smaller than the threshold Here, I i;k l denotes the lth trajectory from location L i to L k and L a is the first transferred location after L i in I i;k l . The first item of formula 12 is the transition probability from location L i to L a . The second item is the conditional probability with formula 3. The third item is a recursion item which represents the transition probability from L a to L k . The variable (L k ; I a;k l jL a ) denotes the transferring status from L a to L k . The variable (Dt À U ðkÞ b þ U ðaÞ a ) denotes that the user will change the location from L a to L k aftertime interval (Dt À ðU ðkÞ b À U ðaÞ a Þ). U ðkÞ b represents the βth transition time for L k and U ðaÞ a represents the αth transition time for L a . It is needed to get the transition time for each location. U ðiÞ ¼ ft 1 ; t 2 ; . . . ; t n g represents thetransition time series for location L i and they are labeled as the node in Fig 3. The possible transition time is extracted from the training data. Firstly, the location for every sign-in time is recognized. And then select the marginal sign-in time when the location transition occurs. Finally, all of the marginal sign-in time is clustered and the center of each cluster t 1 , t 2 , . . ., t n is selected as the transition time series. In addition, a random bias ξ for t i is used to simulate the interval of status transition.

Location prediction algorithm
The location prediction algorithm based on the time series is shown in Table 3, which is implemented by the recursion strategy. For start location L i , predict the next location with the maximum probability after time interval Δt. The array P records the prediction probability for each location.
The probability distribution for each location is calculated recursively. The transition time for location L a is recorded in the vector U ðaÞ , and if there is more time to transfer to next location, the algorithm TDLP will be implemented recursively. Otherwise, the recursion will be stopped.
Here, ξ is a random value used to simulate the transition time interval. It means that user may change location after time interval ξ. We give the experiments to set different ξ value, and it denotes that the better performance is achieved with ξ = 5 minutes, which is shown in the experimental section.

DataSet
GeoLife, developed by Microsoft, is a location-based social-network project. It enables users to share life experience and build connections among each other by using location history. Furthermore, it contains 182 users' travel records and total of 17621 trajectories. Most of the tracing points in the dataset are located in Beijing and our experiment also gives the prediction and analysis for location here. The statistical data distribution after filtering and clustering is shown in Fig 4. The parameter of maximum density for DBSCAN clustering algorithm is set to 0.0005 and the minimum size of a cluster is set to 20.

Algorithm 3. TDLP(Time-Dependent Location Prediction Algorithm)
Global Variable:P ¼ ðPðL 0 L i ; t; DtÞ; PðL 1 L i ; t; DtÞ; . . . ; PðL m L i ; t; DtÞÞ P L a ¼ PðL a L i ; t; DtÞ Input:  Most locations, around 87%, have only a little tracing points(less than 1000). For total 182 users, average sign-in frequency in these locations is less than 6.
The average residence time distribution for different locations is shown in Fig 5. It is also found that average residence time is less than 5 minutes for more than 80% locations.
It can be concluded that the distribution of the tracing points is dispersive and user's moving frequency is relatively fast, which will play an important role in experiment performance.

Experimental result
Among the total 182 users, there are 22 users who have trajectory less than 4 and the distribution is shown in Fig 6. They are filtered from the dataset. For the remaining 160 users, we select their trajectory data in adjacent 20 days as training set and the data of next 5 days is used as test set. For users who have trajectory data less than 25 days, we choose 80% of it for training and the rest for testing. The evaluation metric is precision and itis computed by Precision = A/ B. Here, A denotes the number of correct samples by prediction and B denotes the total number of samples by prediction.
Clustering result for location discovery. The trajectory distribution has a big difference between weekdays and weekends. We divide the data into two groups for clustering respectively. The sample of clustering results on weekdays and weekends data are shown in Fig 7(a)-7(d).
As it can be seen from Fig 7, there are more tracing points on weekdays than weekends because more activities happened within five weekdays. The DBSCAN algorithm is effective for location clustering, and the boundary of different clusters are clearly in line with reality. Here, the clusters with different color denote different discovered locations.
Prediction performance impact by different parameter

• Impact by different time interval Δt
The evaluation result of location prediction after different time interval Δt is shown in Fig 8. The prediction precision of GMM is shown in Fig 8(a). For different time interval Δt, the performance of CTS-MM are shown in Fig 8(b)-8(f) separately. It is noticed that the start time of Fig 8(f) is set to 1 o'clock because sign-in behavior across different day is not considered.
We find that the performance varies with time t on these two models: GMM and CTS-MM. The precision is higher in early of the morning and lower in other time on GMM, which is shown in Fig 8(a). On the other hand, the prediction performance of CTS-MM shown in  • Impact by different ξ value The parameter ξ mentioned in Table 3 is used to simulate the transition time interval between different locations. We give the experimental results with different ξ value in Table 4.
With different prediction time interval Δt, the prediction performance is better when ξ is set to 5 minutes. And the precision gets 0.451 when time interval Δt is set between 10 minutes to 30 minutes.
Prediction performance by different user. We give experiment on different users and the comparison results on both GMM and CTS-MM with different time intervals Δt are shown in Fig 9(a)-9(f).
Compared with the average precision, each user's prediction performance deviated largely (no more than ±15%). The performance of CTS-MM is improved about 12% with the comparison of GMM when time interval Δt is shorter than one hour. It declines significantly after one hour, which is shown in Fig 9(f). The user's travel uncertainty increases with longer duration time.
Discussion. The prediction performance is evaluated on GeoLife dataset and the average precision is about 43% by CTS-MM. Compared with other related algorithms shown in Table 5, fewer features are used and the precision is higher by CTS-MM.   Geographical location prediction based on Markov model Our proposed model CTS-MM and GMM can give prediction on continuous time and furthermore CTS-MM performs better than GMM improved by approximately 12% on precision. The trajectory information feature is used for training by these two models. A common problem with related work is the inability to make location predictions on real continuous time. In addition, Song [24]also gives the location prediction by part of the better trajectory with ID number 0 and 17 on GeoLife and the precision varies from 20% to 80%. But the result on all of the trajectory data is not given.
Location prediction on continuous real-time data is a challenging task and our CTS-MM model makes it possible to predict for any time parameter t. The real time information is used for training and the model is applied on real time series for prediction, which makes result more reasonable. Our work is implemented on the total dataset of GeoLife and the average precision on different time interval varies from 10% to 90%. Within the time interval of 1 hour, the average prediction precision of all the users varies from 20% to 60%.
We have filtered 22 users who have fewer trajectory from the GeoLife data, and the robustness of our model is not checked on the sparse trajectory data. It may make a difference in prediction performance and the further analysis will be our next step. An effective location prediction can bring a better user experience, such as advertising directed at customers based on their current location.

Conclusion
With the increasing of user's location data, the application of location based service becomes more and more popular and it is important to provide superior service for the user. The most common method for location prediction is Markov Model. But it only considers the user's location movement sequence and it is impossible to give a prediction related to the time information. We put forward a new method by Markov Model based on the Continuous Time Series(CTS-MM). The Bayes model and GMM are used for modeling the posterior probability of the location with continuous time series. The discrete status sequence of the HMM is changed to continuous sequence, which enables the model to predict the location in different real time. The experimental results on GeoLife data denote that the proposed model achieves a higher precision than traditional methods. The distribution of the tracing points on GeoLife is dispersive and the user's moving frequency is relatively fast, which makes the prediction task more challenging. Specially, with the increasing of time interval Δt, the precision will decline because it brings much travel uncertainty.
In future work, other models will be considered to improve prediction performance. At the same time, some other effective information, such as sign-in data and user data, will be used for assistance.