Human mobility prediction from region functions with taxi trajectories

People in cities nowadays suffer from increasingly severe traffic jams due to less awareness of how collective human mobility is affected by urban planning. Besides, understanding how region functions shape human mobility is critical for business planning but remains unsolved so far. This study aims to discover the association between region functions and resulting human mobility. We establish a linear regression model to predict the traffic flows of Beijing based on the input referred to as bag of POIs. By solving the predictor in the sense of sparse representation, we find that the average prediction precision is over 74% and each type of POI contributes differently in the predictor, which accounts for what factors and how such region functions attract people visiting. Based on these findings, predictive human mobility could be taken into account when planning new regions and region functions.


Introduction
People in cities suffer from increasingly severe traffic jams. Tremendous efforts have been made to study the mechanism regarding how traffic jams are generated. One dominant point is that traffic flows can be regarded as mutually affected components in a complex network; as a result, traffic jams spread out to affect traffic flows in the whole city through local interactions [1]. The endeavor being devoted to human mobility pattern mining and traffic flow prediction has resulted in the following analytical methodologies: Statistical Methods [2] [3], Nonnegative Matrix Factorization and Optimization Methods [4], Entropy-maximizing Methods [5], Multiscale Radial Basis Function (MSRBF) networks [6], and Deep Learning Methods [7]. However, such studies are limited in that they do not take into account the social contexts in which traffic flows are generated, say, the motivations that drive people to travel from one place to another place. In fact, there are regular patterns to people 0 s travel behavior and many motivations that may cause people to visit a place at a particular time. For example, most people travel to offices at morning rush hour and go shopping in their leisure time. It is experimentally observed that the daily travel of an individual has a high possibility of falling within deterministic paths [8]. Therefore, awareness of the causes of collective human mobility is the key to alleviating traffic jams in the urban planning stage, which is preferable to regulations that attempt to fix deterministic traffic flows. In addition, the number of visitors to a certain PLOS  region is of interest to the people to start new business if predictable. This study aims to answer two questions that have never been explored: (1) Do certain region functions lead to predictable human mobility, and is it possible to predict such human mobility from region functions? (2) Is there any deterministic relationship between each type of region function and the attraction of people? If so, how does one quantitatively evaluate the impact factor of every causal of human mobility (e.g., business, residence, entertainment, and shopping)? In this study, we format the problem as sparse representation rendered prediction and optimize the predictor in a manner similar to variable selection. We conduct the experiments with the GPS traces of 12000 taxies in Beijing city for one month. The results are encouraging in that over 74% prediction accuracy has been obtained when taking the POI (point of interest) feature of 42×45 blocks as input. Moreover, the dominant factors contributing to collective human mobility at specified times are identified and quantified, which coincide with the prior knowledge.

Related work
In early studies, mobile phone data are used for inferring social networks [9] and land survey [10], which advance the studies on social and natural phenomena through human mobility.
Recently, region function discovery and land use identification from human mobility have attracted much attention. In [11], blocks in Beijing are grouped into a couple of regions with explicit functions consistent with human knowledge, where human mobility features extracted from taxi trajectories are combined with POI features to infer the attributes of a given block in the framework of Topic Model. Then, by clustering the blocks in accordance with the similarity between the attributes over latent topics, a couple of regions dominated by a certain function are obtained, for instance, diplomatic and embassy, education and science, and commercial/entertainment areas. In [12], the land uses of Manhattan are categorized into 4 classes, say, commercial, residential, industry, and recreation areas, by clustering tweet activities with geo-tags in each land segment, where Self-Organizing Map (SOM) is applied to obtain the segmentation of the land and k-means clustering is employed to group the land segments with similar tweet activities. In [3], taxi traces in Shanghai are decomposed into the linear combination of 3 basic patterns by means of Non-negative Matrix Factorization, where the 3 basic patterns correspond with commuting between home and workplace, business traveling between two workplaces, and trips from or to other places. As a result, the land use of the origins and destinations of the traffic flows can be inferred from such basic mobility patterns. In [13], the categories of POI are inferred from the taxi stops in Milan based on Hidden Markov Model. In [14][15] [16], supervised methods are applied to classify regions into a couple of categories of different land uses with manually annotated labels a priori for training. Unlike the aforementioned works, which are focused on region function annotation, our goal is to infer whether there are consistent mappings to correlate people 0 s collective mobility to the Bags of POI features and to determine how to predict the traffic flows to be generated by newly planned regions through the POI features. Awareness of such issues makes optimization of traffic loads at the urban planning stage possible. In [17], the correlation between human mobility patterns and region functions is mined through co-training based multi-view clustering, but the association between the two modalities of the data holds in about 50% of cases only. However, the aforementioned study does not solve the following two problems: (1) How to discover the impact factor of each region function in terms of affecting human mobility. (2) How to predict the corresponding traffic flows from the functions of a region. This study proposes to solve the two problems in the framework of sparse representation. In the author 0 s previous work [18], all regions of the city of interest are divided into hot regions and non-hot regions in a coarse manner according to the population flow in different time periods. In this study, fine-grained region partitions are considered in contrast to [18]. Moreover, the impact of the city functions of neighboring regions in terms of their effect on the human mobility of the target region is investigated in detail. The varying parameters applied to region partition and analysis of the influence of neighboring regions lead to more extensive experiments over [18].
The rest of the paper is organized as follows: In section 3, we propose details about predicting human mobility from region POIs. The experimental findings under different cases are presented in section 4. Discussion and conclusive remarks are provided in section 5.

Predicting human mobility from POI
With the continuous development of urban construction, all kinds of living, working, and entertainment facilities have been continuously added to a city 0 s different regions to configure various functions for every region. So far, the city functions enabled by such facilities can mainly be sorted into 20 categories: life services, corporations/enterprises, government agencies and social organizations, shopping services, scenic spots, and so on, as shown in Table 1. The various types of city functions with geo-tags are commonly known as POIs (points of interest), where each original POI record includes the name and the category of the place, and the latitude and the longitude of the position. In order to study the differences between regions in a city, a grid is applied to divide the urban area into a couple of regions. Taking Beijing City as an example, as illustrated in Fig 1 with OpenStreetMap, the city is partitioned into a couple of regions, and the functions of each region are in fact an assembly of the POIs listed in Table 1.

Bag-of-Words representation of city functions
For every region as shown in Fig 1, we collect the POI records through the free software, namely, Baidu API. Here, what we are concerned with is the city function denoted by every POI record, and as shown in Table 1, there are 20 POI categories in total. Hereafter, we characterize the city functions of every region through the Bag of Words model, which is well-known in the literature of Natural Language Processing. That is, for region r, the corresponding Bag of POI (BOP) representation is a vector in the form of where p r i represents the number of points belonging to the ith POI category in Table 1 and n=20 the total 20 POI categories.

Statistics of origin-destination flows
Origin-destination (OD) flows, which count the number of individual movements between locations in a city, reflect not only human activity but also urban dynamics, and they are widely used in city planning and traffic engineering [19]. In this paper, we estimate the regional OD flows during different time periods from all taxi traces in the city of interest. Here, O r t and D r t are used to represent the outflow and inflow of region r during time period t.

Identifying hot regions
From the perspective of city planning or transportation management, the regions with large outflows and inflows should receive much more attention. Here, such regions are referred to as hot regions. We apply 2 thresholds to the inflow and the outflow of every region to select the hot regions of a given time duration, which are defined in Eqs (1) and (2) as follows: and thr O,t are the thresholds to meet the following requirements: where α and β are the two parameters to be adjusted empirically and N the number of regions in the city of interest. Smaller values of α and β correspond with a higher degree of hotness of the hot regions.

Discovering correlation between region functions and human mobility
During time period t, we can get the population entering and leaving the hot regions via Eqs (1) and (2), and we formulate the inflows and outflows of all hot regions as follows: where m and k are the numbers of the hot regions in terms of outflows and inflows identified via Eqs (3) and (4), respectively. Meanwhile, the corresponding BOP representation of the city functions of such hot regions can be formatted into two matrices as follows: We suppose that there is a linear relation to correlate the region functions with outflows/ inflows as follows: where W O,t and W D,t are two n-dimensional vectors composing the weighting coefficients to be solved. Once W O,t and W D,t are obtained, we can then know how much each of the n POI categories contributes to generate the corresponding portions of the inflows and outflows in regard to all the hot regions as a whole. Here, sparse representation method (SRM) is used to solve Eqs (5) and (6) since the method leads to as few as possible nonzero weights in W O,t and W D,t given a predefined precision of approximation. In detail, we use Least Angle Regression method (LARS) algorithm [20] to obtain the estimation of W O,t and W D,t as follows: where λ 1 and λ 2 are the Lagrange multipliers to control the balance of the penalty to the two terms, say, the sparseness of the solution and the precision of approximation. Due to the constraints of kW O,t k and kW D,t k imposed on the optimization indices defined in Eqs (7) and (8), we can obtain as few as possible nonzero weights in kW O,t k and kW D,t k, corresponding with the relevant variables in the POI categories contributive to the prediction. Meanwhile, the numerical values of such nonzero weights reflect how big roles they play in the predictor, that is, how much each relevant POI category contributes to the prediction.

Preprocessing
Prior to mining the correlation between region functions and human mobility, some preprocessing steps are necessary. Intuitively, except for the motivation to attract people to the destination region, the functions of the neighboring regions should also have an impact on people 0 s travel behaviors to some extent. Therefore, we absorb the city functions of the neighboring regions into the BOP representation of every region; that is, we compute the ith component of the new BOP representation of region r as: where R is the collection of the eight neighboring regions around region r, and κ is the factor to weight the contribution of region r and the 8 neighboring regions. A bigger κ corresponds with a smaller contribution of the neighboring regions. Then, the term frequency-inverse document frequency (TF-IDF) method is applied to the new descriptor after absorbing the city functions of the neighboring regions, namely, " F r ¼ ½" p r 1 ; " p r 2 ; :::" p r n T . Here, the TF term of the ith POI category is defined as: The IDF terms of the ith POI category are defined as: where |r| is the number of hot regions in the time duration of interest, |r i | the number of hot regions containing the ith POI category, and δ = 0.000001 a minimum value to avoid IDF r i ¼ 0. Then, we can compute the TF-IDF value of the ith POI category in region r as follows: After the above preprocessing, we can get the new region function description for region r: Similarly, we can obtain X 0 D;t . Then, X 0 O;t and X 0 D;t will be applied to Eqs (7) and (8) to obtain the solution.

Performance evaluation metric
In the training stage, after applying X 0 O;t and X 0 D;t to Eqs (7) and (8), we obtain the weightsŴ O;t andŴ D;t . In the testing stage, we applyŴ O;t ,Ŵ D;t , X 0 O;t and X 0 D;t to Eqs (5) and (6) to get the predicted outflows and inflows. Here, the Mean Absolute Percentage Error (MAPE) is used to measure the prediction accuracy, that is, whereÔ r t andD r t are the predicted outflow and inflow for region r, m and k the number of hot regions at time t in terms of outflow and inflow, respectively.

Parameter setting
According to Eqs (7) and (8), the only parameters to be set are λ 1 and λ 2 . By trying λ 1 ,λ 2 2 [0.001,0.3] with an increment of 0.002, we can obtain the weights and the prediction accuracy in the sense of Eqs (13) and (14) for every parameter value, where we set the parameter value as that leading to the highest precision on the training data.

The data
The data of about 12000 taxies with GPS trajectories traveling within the urban area of Beijing City are used in the experiments. The data are mainly collected from 1 to 31 October 2012, and the data of OD flows are extracted from the origins and destinations of taxi journeys with passengers. We take every hour of each day as a time period to compute the outflows and inflows. The sampling frequency of each GPS trace is 1−2 times per minute. The data contains the following items: TAXI_ID, GPS_TIME, GPS_LONGITUDE, GPS_LATITUDE, and STATE, which indicates whether a taxi has passengers. A filtering process is adopted to remove the incorrect records that are caused by hardware faults. In addition, the records out of the area of interest are also discarded. Distribution of the traffic flows throughout the city at

Findings
In the experiments, as for training, we first select 20 days in a month at random to get the hot regions in the sense of Eqs (1)-(4) for all the time periods. Then, the data of the remaining 11 days are used to validate the predictor. After that, the data for work days and weekends are applied separately for performance evaluation. Finally, the traffic patterns in different week days are investigated.
As an example, let the POI weighting factor to absorb neighboring regions be κ = 7/8 and the hot region selection parameters be α = 1/3 and β = 1/3. Counting the taxi OD flows as a snapshot of human mobility, the weighting coefficients solved from Eqs (7) and (8)      After that period, the prediction accuracy is much better. Since 5:00 AM−6:00 AM is not the normal time duration for people to start their daily lives, travel is characterized by greater uncertainty and the OD data are sparse, which causes the lower prediction accuracy. From 8:00 AM, prediction accuracy gradually increases and reaches the highest point at 13:00 PM. During working hours, travel behaviors tend to be random, causing prediction accuracy to degrade. When evening rush hour begins at 16:00 PM, prediction accuracy starts to increase again and reaches another peak at the end of evening rush hour (19:00 PM). Then, high precision is maintained until 24:00. The overall average prediction accuracy for an entire day is 74.50%.
In order to study the effect of absorbing neighboring regions 0 POIs for traffic flow prediction, the POI weighting factor κ in Eq (9) is alternately set to be 0.9, 0.8, 0.7, 0.6, and 0.5 to check the variation of the performance. The accuracies under different κ are shown in Figs 9 and 10 for outflow and inflow prediction, respectively, where the performance curves vary little under different κ. This indicates that the neighboring regions 0 POIs make little contribution to traffic flow prediction for the target region. This is explained as follows: In practice, people get  off only when the taxi arrives at a destination. It happens rarely that people get off in a nearby region and reach a final destination on foot. Nevertheless, the contribution of the neighboring regions 0 POIs slightly improves prediction accuracy during different time intervals according to Figs 9 and 10, so we incorporate them into the POI feature in predicting traffic flows.
How the two parameters α and β for hot region detection affect the performance of traffic flow prediction is also examined. Here, the two parameters are set to be 1/2, 1/3, 1/4, 1/5, and 1/6, alternately. Note that a smaller value of α and β corresponds with less hot regions but higher hot degrees in terms of traffic flow in accordance with Eqs (3) and (4). As illustrated in Figs 11 and 12, the prediction accuracy improves with the increment of the hot degree for the selected hot regions during the majority of the day.
Furthermore, the data for work days and weekends are used separately to evaluate prediction accuracy. First, the data corresponding with work days and weekends are divided into 4 parts at random. Then, the 4−fold Cross Validation method is used for performance evaluation. The average prediction accuracies are shown in Figs 13 and 14. We can see that the fluctuation trend of the prediction accuracy over time in Fig 13 is similar to the case in Fig 14. The overall prediction accuracies for outflow and inflow in different cases are shown in Tables 2  and 3, respectively. From Tables 2 and 3, we can see that the average prediction accuracies for both work days and weekends are better than the case of mixing the two. Therefore, the  Tables 4 and 5 to allow an insight into how general the correlation law holds between human mobility and region functions for each day in a week. As smaller values of α and β correspond with higher degree of hotness, with the decrement of α and β in Tables 4 and 5, we can see that the average prediction accuracies of different week days are gradually improved. In Table 4, although the average prediction accuracies of Tuesdays, Thursdays, and Fridays, say, 79.96%, 80.73%, and 80.40%, are a bit smaller than the  Table 2, the average prediction accuracies of Mondays and Wednesdays, namely, 81.34%, 82.43%, are obviously better. In the meantime, the average prediction accuracies of Saturdays and Sundays, 83.46% and 80.61%, are much better than the 78.94% accuracy of the weekends in Table 2. The overall average prediction accuracy in Table 4 is 81.30%, which is almost 7% higher than that of the 74.49% precision in Table 2. In Table 5, only the average prediction accuracy of Fridays, 79.20%, is slightly smaller than the average prediction accuracy of the work days, 80.68%, in Table 3. The average prediction accuracies of the other week days are all 2%-4% higher than that in Table 3. The overall average prediction accuracy in Table 5 is 82.09%, which is almost 7% higher than the 75.15% precision in Table 3.  Note that in this study, we follow the practice of the previous studies in the literature of urban computing [3][4] [18] to grant taxi flows as the representative of city-wide human mobility, since the sampling of city mobility by a large number of taxies could approach the trend of peoples 0 destinations in a statistical sense under the Law of Large Numbers.

Comparison
In the data preprocessing stage, the TF-IDF method is applied to the new region function descriptor. Some experiments have been done in this section to verify the effectiveness of the method. The average prediction accuracies of different time periods under different hot degrees with and without the TF-IDF method are shown in Fig 15. From Fig 15, we can see that all the prediction accuracies with TF-IDF preprocessing are greater than those without TF-IDF preprocessing, which proves the effectiveness of the method.
With historical outflows or inflows as inputs, the averaging method (AM) and the firstorder exponential smoothing method (FOESM) are used to estimate O r t and D r t as follows:  In Eq (16), if the value of μ is larger, the value of O r tÀ 1 and D r tÀ 1 has a greater effect on the predicted value. We setÔ r 0 ¼ O r 0 andD r 0 ¼ D r 0 during the experiments. After that, the Mean Absolute Percentage Error method as shown in Eqs (13) and (14) is used to evaluate the prediction accuracy with 4-fold Cross Validation. In the experiments, it is found that when the value of μ is set to be 0.5, 0.6, 0.5, 0.5, 0.9, the best average prediction accuracy can be approached under different κ on the datasets of hot regions with (α, β)=(1/2, 1/2), (1/3, 1/3), (1/4, 1/4), (1/5, 1/5), (1/6, 1/6). The best average prediction accuracies of inflow and outflow with different methods are shown in Figs 16-20. From Figs 16-20, we can see that the average prediction accuracies of the proposed method are mostly higher than those of the other two methods. The only exception is the time period from 1:00 AM to 7:00 AM, as few people travel during that time, which makes the OD data illposed. After that time, increasingly many people travel between urban regions, and the OD The results shown in Figs 16-20 confirm that the proposed method promises a better overall performance in terms of both prediction accuracy and stability compared with AM and FOESM.
In addition to the above comparison between different methods, taking all 20−dimensional POIs as inputs, Linear Regression (LR) model and SVR (Support Vector Regression) model are used to predict the traffic flows of hot regions for the sake of comparison with Sparse Representation (SR), where the polynomial kernel is applied in SVR. Performance evaluation is conducted in the sense of Eqs (13) and (14)   From Figs 21-25, we can see that although the LR model has the best stability and prediction accuracies under different cases, around 0.6, its prediction accuracies under different cases are much smaller than those of the proposed SR model, which has prediction accuracies under different cases of around 0.7 in different time periods. The prediction accuracies of the SVR model are much higher than those of the LR model in most time periods but still lower than those of the proposed SR model. The proposed SR model also has better stability than the SVR model under different cases.
In this study, the sparse representation method is used to select the POIs that are relevant to the traffic flows of hot regions, acting as feature selection. It is generally acknowledged that SVR is a much stronger prediction algorithm compared to LR model but its performance is not so good here. We attribute this to the lack of variable selection. Therefore, SR model and SVR model are combined to predict traffic flows. First, SR model is used to select the POIs relevant to the traffic flows of hot regions. Then, the selected POIs are taken as inputs and SVR model is used to predict the traffic flows of hot regions. Following 4-fold cross validation, the average prediction accuracies of different methods are shown in Tables 6 and 7.
From Tables 6 and 7, we can see that the average prediction accuracies of the combined SR and SVR model in the cases of outflow and inflow are 76.60% and 77.38%, respectively, which are almost 10%-17% higher than that of SVR model, and 2% higher than SR model. The above  results show that variable selection as promised by the SR model is critical for understanding the truly contributive POIs in attracting traffic flows.

Conclusion
This study aims to find how urban planning affects human mobility. The discovered relation between city planning and human mobility is important for transportation and business intelligence, which can act as guidelines to design rather than manage traffic at the urban planning stage while boom commercial activeness through region function design from an ecosystem point of view.
In this paper, we proposed a method that can predict the population outflows and inflows for a region only with the numbers and categories of POIs. The contributions of this study lie in the following aspects: (1) As we see, the best way to control traffic jams is to design traffic flows at the urban planning stage since when city infrastructures and travel behaviors are fixed, we are limited in what we can do to alleviate traffic jams. Therefore, human mobility prediction from city functions is a novel and significant problem has not attracted attention so far. This paper aims to study the possibility of predicting traffic flows entering and leaving a region based on the city functions supplied by this region. To the best of our knowledge, this should be the first endeavor to study predictive urban planning in terms of optimizing traffic flows. In the meantime, human mobility is a major issue to be taken into account when planning commercial areas. This study can provide predictive view regarding how to attract people 0 s visiting by designing region functions. (2) Unlike existing studies, this work aims to find out the motivations that cause human mobility as well as the impact factor of every causal. Here, we propose applying sparse representation to a linear predictor to find the contribution of every category of city function in causing people to visit a place. By solving the predictor in the sense of sparse representation, we find that average prediction precision is over 74% and each type of POI contributes differently in the predictor, which accounts for the different reasons as well as degrees for attracting people. Moreover, the discovered factors that have high impacts on people 0 s mobility are valuable in that they can act as the input variables to a variety of predictors to make prediction of human mobility from region functions possible. As confirmed by the experiments, SVR model performs poorly without variable selection to identify the relevant factors affecting human mobility. In contrast, its performance can be greatly improved by applying the selected variables resulting from sparse representation.
The main lessons about human mobility learnt from this study are as follows: (1) The factors highly contributive to human mobility vary from time to time. During the morning rush hour, outflows are mainly from residential areas, restaurants, and hotels, while the destinations are mostly corporations, financial service agencies, and some well-known addresses. For lunch time and dinner time, notably, the main origin is corporation/business, and the main destination is restaurants. At night, the main traffic flows are from restaurants to residential areas and hotels. (2) Urban human mobility patterns in a week day could be different compared to the rest of the week days. The aforementioned findings should be meaningful in understanding urban human mobility patterns for urban planning, where the contribution of each type of city function in terms of attracting people to a region can be quantitatively identified such that the overall effect of region planning can be predicted.
This research is at the very beginning. There should be a lot of future works to be done to improve the work by considering additional factors, such as weather, traffic control, and social event. Besides, more nonlinear predictors should be evaluated in comparison with the proposed method. Furthermore, more data of different cities should be applied to validate the proposed method. Taking into account climate, urban traffic should not be the same in different months. In the follow-up study, we will collect more data for different months to validate the proposed model.