Vehicle trajectory prediction and generation using LSTM models and GANs

Vehicles’ trajectory prediction is a topic with growing interest in recent years, as there are applications in several domains ranging from autonomous driving to traffic congestion prediction and urban planning. Predicting trajectories starting from Floating Car Data (FCD) is a complex task that comes with different challenges, namely Vehicle to Infrastructure (V2I) interaction, Vehicle to Vehicle (V2V) interaction, multimodality, and generalizability. These challenges, especially, have not been completely explored by state-of-the-art works. In particular, multimodality and generalizability have been neglected the most, and this work attempts to fill this gap by proposing and defining new datasets, metrics, and methods to help understand and predict vehicle trajectories. We propose and compare Deep Learning models based on Long Short-Term Memory and Generative Adversarial Network architectures; in particular, our GAN-3 model can be used to generate multiple predictions in multimodal scenarios. These approaches are evaluated with our newly proposed error metrics N-ADE and N-FDE, which normalize some biases in the standard Average Displacement Error (ADE) and Final Displacement Error (FDE) metrics. Experiments have been conducted using newly collected datasets in four large Italian cities (Rome, Milan, Naples, and Turin), considering different trajectory lengths to analyze error growth over a larger number of time-steps. The results prove that, although LSTM-based models are superior in unimodal scenarios, generative models perform best in those where the effects of multimodality are higher. Space-time and geographical analysis are performed, to prove the suitability of the proposed methodology for real cases and management services.


Introduction
Sustainable transport is one of the explicitly mentioned good practices (https://sdgs.un.org/ taxonomy/term/1198) for the achievement of the Sustainable Development Goals (SDG), mainstreamed across several goals and targets but with a specific link to SDG 11, "Make cities and human settlements inclusive, safe, resilient and sustainable". Sustainable transport is considered capable to "achieve better integration of the economy while respecting the predictions can be equally plausible in a given scenario, there is no single correct solution for the deep model to find [10]. The focus of this paper is to address the problem of multimodality in FCD trajectory prediction. The main motivation behind this research is the lack of existing methods for approaching trajectory prediction in a multimodal fashion, which limits the scope of practical applications. To address this problem we propose a generative method that can produce multiple plausible predictions. The evaluation of unimodal methods is also distorted by the biases in the standard metrics, so we introduce new ones that normalized those biases.
We train and evaluate predictive and generative models on 4 different datasets, pointing out the advantages and disadvantages of each method. FCD used for these analyses were acquired and pre-processed by the company VEM Solutions S.p.A. during an entire week (from 2018-10-05T22:00:00 to 2018-10-11T21:59:59 CET) in the four most populous Italian cities (Milan, Naples, Rome, and Turin). VEM on-board units (OBU) are mounted on fleets and private cars with a varying penetration rate, estimated between 2 and 8%.
While predictive methods aim to minimize error with respect to ground truth, generative ones aim to produce samples that are as realistic as possible. There are two important consequences. The first is that while predictive models have only one solution, generative ones can have several. The second is that predictive models tend to predict an "average behavior", while generative ones can reproduce real ones. This distinction is fundamental because of the multimodal nature of the problem. While it may be impossible to predict a trajectory in highly uncertain scenarios (i.e., the correct solution cannot be found), it is still possible to generate (i.e., enumerate) many plausible solutions that do not necessarily minimize prediction error. We associate the prediction problem with unimodal scenarios and the generation problem with multimodal scenarios.
In the evaluation of our experiments, we use standard metrics for trajectory prediction, i.e. the Average Displacement Error (ADE) and the Final Displacement Error (FDE), as well as new normalized metrics that reduce the bias is unaccounted for by the standard ones. Those new metrics, i.e. the Normalized Average Displacement Error (N-ADE) and the Normalized Final Displacement Error (N-FDE), are one of the main contributions of this work, alongside the datasets and models. When evaluating our multimodal GAN-3 model, we take into account its ability to correctly reproduce a behavior rather than identify the exact one among many alternatives, which is not possible in truly multimodal scenarios.
The advantage of the proposed GAN-3 model over those used by other works is its ability to represent multiple distinct behaviors instead of producing a single average prediction, this makes the predictions more realistic and unaffected by the problem of multimodality. Similarly, the advantage of the proposed metrics is that they are unaffected by biases in the standard ones that depend on properties on the trajectories and not on the predictions.

Challenges
Predicting vehicle trajectories is an arduous problem that requires addressing several challenges. The most relevant ones can be grouped into the following categories: • Vehicle to Infrastructure (V2I) interaction. Vehicle trajectories depend on the environment, which can be the physical space (i.e. the infrastructure, or road network) or the temporal window in which they are collected.
possible trajectories of other vehicles, or by influencing their behavior as they must follow traffic rules.
• Multimodality. Trajectory prediction is highly uncertain because there are multiple possible behaviors and multiple correct solutions. Uncertainty has an aleatoric component, which cannot be reduced because it depends on the randomness of the system, and an epistemic part, which can be reduced by adding new information.
• Generalizability. A method should be evaluated on its ability to predict the entire distribution of possible vehicle trajectories, collected in a realistic environment and not in a controlled one.
Exploring all these challenges would require a lot of additional research. This paper focuses on the multimodality aspect, as we will show how predictions are influenced by high uncertainty. We will also partially tackle the problem of generalizability by proposing new realistic FCD datasets collected in four different cities, and showing how different datasets yield different results.

Contributions
The main contributions of this paper are the following: • New datasets. We propose four new FCD datasets with a large number of vehicle trajectories. Those datasets contain a lot of information about vehicles and road networks, but for our purposes, we create a preprocessed version that only takes into account vehicle coordinates, where we split long trajectories into short ones, more practical for real-time predictions.
• New metrics. We propose a normalized version of the standard ADE/FDE metrics to take into account different biases in the evaluation of the models in different scenarios.
• A generative model for multimodal scenarios. We propose a GAN model able to generate multiple predictions and overcome some of the challenges of purely predictive models (e.g. LSTM).
• The application of innovative data analysis and visualization techniques. We tested space-time pattern mining tools as support methods to better understand spatiotemporal trends and to generate analysis outputs suitable to support UTC actions.
The paper is organized as follows. Section 2 provides a description of the approaches adopted when facing vehicle trajectories. Section 3 thoroughly describes our contributions, i.e. datasets, metrics, and the GAN-3 model. In Section 4, we evaluate and compare predictive and generative models on our proposed datasets with the standard metrics and our newly proposed ones. Finally, in Section 5, conclusions and discussion about future directions for this field of research are provided.

Related works
The use of FCD for traffic management and control, along with its benefits and limitations, has been widely discussed, generally mentioning the insufficient penetration rate as the main limiting factor [11]. Houbraken et al. [12] make a review of different studies related to the necessary FCD penetration rates and find that this value is dependent on the intended application and data sample frequency. Altintasi et al. in [3] find that penetration rates around 3% are sufficient to identify some critical urban traffic states.
Ajmar et al. in [13] compare floating car data (FCD) with low penetration rates with data acquired by fixed sensors. Currently, most models used by UTC are based on traffic waves and on data acquired by fixed sensors that have several issues: they record data at a specific location and, due to their installation and maintenance costs, they are not sufficiently distributed, with a reduced or nonexistent coverage especially on minor roads. This leads to the necessity of building dedicated models to estimate the traffic states at other locations and therefore of dealing with related uncertainties.
Guo et al. in [1] provide a systematic review of CAV-based UTC studies. They state that the availability of high-resolution trajectory data of individual vehicles could increase the understanding of traffic flow states, which are critical to traffic control, allowing to estimate values not only related to flow volumes but also travel time, queue length, and shockwave boundary that cannot be calculated based on fixed sensors data only.
Many works use FCD with the aim to detect and analyze traffic in an urban environment. Chen et al. in their recent work [14] use the speed performance index (SPI) to evaluate the traffic congestion based on the speed of the traffic flow and the speed limits of the road. They propose a categorization criterion to make consistent the classification results of SPI with the five levels standard of traffic congestion classification. The investigation has the aim to detect the paths of traffic congestion, finding both spatial and temporal correlations.
Similarly, Altintasi et al. in [3] use the FCD with the information on speed to detect the traffic status. They process the raw FCD to obtain 4 Levels of Service (LOS) that identify the traffic state. Each LOS corresponds to a different range of speed and traffic situation. With the experimental phase in a real urban environment, by observing the pattern variations, they are able to detect how traffic changes through time in a given location.
Sunderrajan et al. in [15] also evaluate the traffic flow departing from FCD. The authors do not only consider traffic speed, but also other aggregate macroscopic quantities such as density and flow. This was the first attempt to estimate both traffic flow and density from FCD. Moreover, an interesting novelty is the introduction of a method to determine the minimum rate of probe vehicles needed to reconstruct the traffic state in different conditions. However, the papers presented in this section so far do not make traffic predictions on FCD but detect and recognize the situation of traffic congestion. A pioneering work in this direction is the one proposed by De Fabritiis et al. in [16]. This work exploits the real-time FCD based on GPS positions to create a system for traffic estimation and prediction. They propose a system that extracts the speed information, making an estimation of traffic conditions. Moreover, using two different approaches based on pattern matching and artificial neural networks, they are able to predict the short-term speed traffic conditions, with a maximum prediction time of 30 minutes.
More recent is the work of Kong et al. [17], which exploits the trajectories data to make predictions of traffic congestion, indicating in real-time an alternative path. The parameters used to represent the traffic flow are the traffic volume and traffic speed. The idea is to use a fuzzy comprehensive evaluation method to determine the traffic flow and predict the traffic congestion, where the weights of the indexes are assigned according to the traffic flow. These weights are determined by the Particle Swarm Optimization (PSO) algorithm, and then a congestion state fuzzy division module is applied to convert the predicted flow parameters to citizens' cognitive congestion state. The metrics used to evaluate the performance of the proposed approach are accuracy, instantaneity, and stability.
A method based on PSO is also proposed by Luo et al. in [18]. The authors present a hybrid architecture for short-term traffic flow prediction by combining PSO and a genetic algorithm. This hybrid approach is applied to the Least Square Support Vector Machine (LSSVM) to select the appropriate parameters to predict short-term traffic flow.
More recently, the most used approaches for traffic flow prediction are based on deep learning algorithms [19]. The most used deep learning method for traffic prediction is the Long Short-Time Memory (LSTM), as in [20,21]. Vazques et al. in [22] compared four deep learning models (LSTM, GRU, SRCN, HGC-LSTM) for urban traffic prediction. The dataset they used consists contains the average speeds in different periods of the tracked vehicles. Given the state of the network in a time period, the models have to predict the state of the network in the following one. The metrics used for the comparison are the Mean Average Error (MAE) and the Root Mean Square Error (RMSE). Other works in vehicle trajectory prediction, like the DESIRE framework proposed by Lee et al. in [23], experiment on public datasets like KITTI [24] and Stanford Drone Dataset [25], that provide visual information instead of FCD.
In [26], the authors introduce and compare two different machine learning methods that use floating car trajectory data to predict the occurrences of crashes on urban expressways: a binary logistic regression model and a Support Vector Machine (SVM) model. According to the authors, the latter performs greatly for predicting crashes on urban expressways. Implementations of such models for monitoring real-time conditions seem promising in helping to detect and prevent potential crashes from happening.
Finally, some works from the domain of human trajectory prediction propose interesting methods that could be extended to the domain of vehicle trajectory prediction. For example, Alahi et al. in [27] propose the Social LSTM, a network that predicts human trajectories taking into account interactions with other human trajectories. This is the equivalent of V2V interaction for vehicles. Gupta et al. in [28] extend this work and propose the Social GAN, that generates multiple predictions in a multimodal context, similarly to our work.

Methodology
Among the four challenges outlined in the introduction, we focus on multimodality and partially on generalizability. In this Section, we provide a formal definition of the trajectory prediction problem (Section 3.1), an analysis of the proposed datasets (Section 3.2), a description of the proposed normalized metrics (Section 3.3), and the implementation details of our GAN-3 model (Section 3.4).

Problem definition
A trajectory is the route taken by a vehicle over a set period of time. Unless otherwise stated, we define it as a vector of numerical values that alternatingly correspond to the X i and Y i coordinates at increasing timestamps i separated by a fixed time-step. A trajectory of n points is a sequence of 2n values X 1 , Y 1 , X 2 , Y 2 ,. . ., X n , Y n , or an element of the set T as defined in (1).
The trajectory point index is mapped to a timestamp by the function time(x), which respects the condition in Eq (2), with K being a fixed time-step.
The trajectory prediction problem is defined as the task of estimating values of future timesteps according to the observed values of past time-steps by minimizing a loss function based on the distance from the ground truth. We define the trajectory prediction problem as the following sequence generation task: given an observed trajectory of m < n points as a sequence The generation problem is defined as the task of synthesizing samples that can realistically fit into the original data distribution. The trajectory generation problem is similar to the prediction one, but instead of minimizing some error metrics, the goal is to find some t i 2 T that realistically describe a vehicle trajectory.
It is important to clarify that the problem of trajectory prediction is different whether we consider short trajectories (a few seconds long) or long ones (minutes or hours long). These are two different problems with different challenges and applications.
Since we address the problem of multimodality, the trajectories we analyze are a few minutes long: long enough so there are different possible routes in order to make a multimodal analysis, short enough so that this analysis can be applied to contexts such as self-driving cars and traffic congestion prediction. The longer a trajectory is, the more difficult it is to predict, and it requires a different kind of analysis.
In the rest of this paper, the terms "short trajectory" and "long trajectory" will have a different meaning and refer to the precise length of the trajectories we analyze: 12 minutes for short trajectories, 20 minutes for long ones. In both cases, the observed part of the trajectory has a fixed length of 8 minutes. We expect the prediction error to be larger for long trajectories.

Data processing
For our analysis, we introduce new datasets obtained by preprocessing the GNSS data of vehicle trajectories from 4 different cities, already described in [13]: Rome, Turin, Milan, and Naples.
These datasets contain a vast amount of data, both in form of data points and features. They provide useful information about the type of vehicle and the road graph, but for our purposes we discard most of it and only keep coordinates and timestamps. We apply some preprocessing steps to split macro-trajectories into shorter ones with a fixed number of data points and reduce noise. Table 1 reports the number of data points for each dataset, before and after preprocessing, as well as the number of resulting trajectories. There are four preprocessing steps: 1. Resampling: trajectories (lists of coordinates associated with the same vehicle) are resampled with a 60000 ms period, as the original sampling is irregular (with a mean of 60000 ms). This makes all the trajectories uniform time-wise and partially reduces noise. In the filtering step, idle and noisy trajectories are removed. We consider a trajectory noisy if it contains at least two consecutive coordinates with a difference greater than 0.1 on values normalized in the range [0, 1]. We consider a trajectory idle if its variance is lower than ]. An idle trajectory must be completely idle. For example, trajectories of vehicles that start moving after having been idle for just a few time-steps are not considered idle. Removing idle trajectories is necessary because they decrease the value of error metrics without reflecting the ability of a model to predict real, non-idle trajectories. Following these steps, we obtain two different preprocessed datasets: one with short trajectories and another with long ones. The former has 12-point trajectories (8 observed points and 4 predicted points) while the latter has 20-point trajectories (8 observed points and 12 predicted points).

Evaluation metrics
The standard metrics used in the field of trajectory prediction are the Average Displacement Error (ADE) and the Final Displacement Error (FDE) proposed by Pellegrini et al. [29]. ADE measures the RMSE between each pair of predicted and true trajectory points, while FDE only considers the difference between the predicted and true final points.
Although these metrics are straightforward to implement and interpret, they have some limitations when evaluating methods used in real-world scenarios. In controlled scenarios, trajectories tend to have similar properties (e.g., same length), but in real-world scenarios, these properties can greatly affect the evaluation. For example, the value of ADE tends to be higher for longer trajectories (with many data points) because farther time-steps are more unpredictable and thus the average error tends to be larger. This problem can't be solved by normalizing only once (i.e., dividing the total error by the number of time-steps), as the error doesn't grow linearly.
For example, if trajectory T 1 is longer than trajectory T 2 , the average error of T 1 will be higher than the average error of T 2 because the higher uncertainty of the farthest time-steps increases the average error.
The use of ADE/FDE would generate a bias against longer trajectories and favor models that learn short trajectories well. These limitations justify the use of the new metrics defined in this work; however, ADE/FDE are still valuable and intuitive, so we use them jointly with our proposed normalized metrics N-ADE/N-FDE. In fact, N-ADE/N-FDE are not a replacement for ADE/FDE. The advantage of the normalized metrics is that they are mostly independent of trajectory properties, the disadvantage is that they can only be used in an aggregate manner since they don't provide useful information if evaluated on a single trajectory.
Many alternative ways to measure the similarity between two trajectories have been proposed over the years, and several interesting approaches, like those based on Edit Distance or Dynamic Time Warping, have been reviewed by Magdy et al. [30]. While most of these approaches address the limitations of ADE/FDE, they are more computationally expensive and lose physical meaning (i.e., they don't have a physical unit of measurement unit, such as meters). A good alternative to ADE/FDE should not only take into account the biases previously mentioned, but it should also be computationally efficient while maintaining physical meaning.
N-ADE/N-FDE are defined by normalizing ADE/FDE, to the square root of these properties multiplied by constants. The square root attenuates the effect of this normalization; it should not be completely independent of the trajectories, and the constants are only used to maintain an intuitive physical meaning.
We define the aggregates Normalized Average Displacement Error (N-ADE) and Normalized Final Displacement Error (N-FDE) as in Formulae (3) and (4).
The normalized metrics are calculated using the parameter P, which is the set of chosen properties to normalize. ADE/FDE are multiplied by the Normalization Factor (NF) defined in Formula (5). ADE(t), FDE(t), and NF(t) are functions of the trajectory t. The NF is calculated as the inverse of the product of the chosen properties p(t), multiplied by the constants K p .
For our purposes, we choose the length (in meters) and the variance of the trajectory as properties to normalize, but future works could use different properties. This choice comes from the assumption that long trajectories are more difficult to predict and they lead to higher errors, as well as high-variance (noisy) ones. We redefine our NF as in Formula (6).
As for the constants, we chose the median values of the datasets, truncating the length to the integer and the non-linearity to the first decimal, as in Formulae (7) and (8) (it's worth remembering that these constants aren't necessary and are arbitrarily chosen, we only use them to make the metrics more intuitive).
The aggregate N-ADE and N-FDE of an entire dataset are calculated as the median value, which is a better choice than the mean. This is because it is more robust to very low and very high values, which are common with N-ADE/N-FDE defined using the chosen properties. The error can approach infinity if these values are too low, and the average error of a whole dataset can be huge even if there is only one faulty trajectory. This is not an issue when using the median aggregator M t .

Implementation details
We approach the problem of vehicle trajectory prediction with two different methods that we describe here and compare in Section 4.
The first method is purely predictive and is based on an LSTM architecture. This method tries to predict the future positions of the vehicles by minimizing the Mean Squared Error loss function.
The second method has a generative nature and is based on a GAN architecture. This method tries to generate realistic trajectories that can be used to make multiple predictions. This is necessary for multimodal scenarios, where multiple correct solutions are possible. With the GAN-3 model, we generate 3 different trajectories, each representing a different behavior, then pick the correct behavior and measure the error w.r.t. that behavior.
Each of the two methods is more suitable for particular scenarios (namely, LSTM for unimodal scenarios and GAN-3 for multimodal ones). We then compare the two approaches in Section 4, where we also compare GAN-3 with its unimodal counterpart, GAN-1, which is expected to perform worse than the purely predictive LSTM.

Predictive model architecture.
The chosen predictive architecture is based on the Long Short-Term Memory (LSTM) network. This model is a Recurrent Neural Network (RNN) and is best suited for problems that require learning patterns in a time series (a sequence of values corresponding to increasing timestamps with a fixed period or time-step); for this reason, it is also the standard choice in other works in this field. An LSTM network contains one or more LSTM layers that take as input a time series that describes the past and produce as output another time series that describes its future prediction. The input of those layers needs to be reshaped in such a way that one dimension represents the timesteps and another dimension represents the value of the data at each timestep. Our LSTM is stateless, which means that no memory is retained when training on different trajectories, as they do not belong to the same time series. Optimal results are achieved quickly with minimal tuning considering that the training data has a very simple representation (a sequence of 16 numbers).
The architecture of the LSTM model is very simple: it consists of two stateless LSTM layers with 32 and 16 neurons respectively, followed by a Dense layer with 16 neurons and a final Dense layer with a number of neurons corresponding to the number of timesteps to predict (that can be 4 or 12) multiplied by the number of coordinates (2). The first layer receives an input of 16 values as the 8 (x, y) ordered coordinates in the observed sequence. A simple visual representation is shown in Fig 1: the observed trajectory is given as input to the Input layer, then it's passed to the Reshape layer that organizes the input in a way that groups and separates the timesteps by the x and y coordinates. The resulting data is fed to two consecutive LSTM layers. Finally, we have a fully connected Dense layer followed by the output, i.e. the predicted trajectory as another Dense layer. We train our model with a batch size of 256. We use the Rectified Linear Units (ReLU) activation function. The loss function to minimize is the Mean Squared Error.
We use the Adam optimizer with an initial learning rate of 0.001. Regularization is achieved with Batch Normalization and Dropout layers set with a probability of 0.2. We do not limit the number of epochs since the models converge fast, instead, we set a patience of 200, which means that the training stops if no improvements have been made over the last 200 epochs.
Each of the 4 datasets (Rome, Turin, Milan, and Naples) is shuffled and split into a training set (56%), a validation set (14%), and a test set (30%). Partial experiments with cross-validation did not result in significant improvements, so we decided to use a fixed validation set to maximize training speed.

Generative model architecture.
We use a Generative Adversarial Network (GAN) to generate plausible trajectories and thus produce multiple predictions in a multimodal environment. The GAN model, introduced by Goodfellow et al. in [8], consists of two networks, a Generator and a Discriminator. The Discriminator learns to distinguish between real (i.e., fitting into the original data distribution) and fake data, while the Generator aims to fool the Discriminator by producing realistic data. If the model is trained well, the Generator will be able to produce new data samples that fit into the training data distribution; in our case, the predicted trajectories.
We distinguish among two generative methods, GAN-1 and GAN-3. For simplicity, we will refer to GAN-1 and GAN-3 as two different models, while they are actually the same model and are trained in the same way. The only difference is in the testing phase.
Both the Generator and the Discriminator receive the observed trajectory as input. The Generator also takes a latent vector as input and produces the output trajectory. The Discriminator also receives the generated trajectory as input and decides whether the entire trajectory is real or fake. A simple visual representation of the GAN model is shown in Fig 2: two Input layers take the latent vector and the observed trajectory that are concatenated and fed to the Generator layer. The Generator layer produces an output that is concatenated with the observed trajectory input to produce the input for the Discriminator layer. The Generator and Discriminator layers are exploded respectively in Fig 3a and 3b.
We use LeakyReLU instead of ReLU as the activation function of both Generator and Discriminator layers. LeakyReLU allows a small positive gradient when the neuron is not active, it helps to mitigate the vanishing gradient problem, which is particularly common in GAN models. In general, high regularization is needed when designing a GAN Discriminator to avoid The Discriminator has to learn to distinguish real trajectories from fake ones. By contrast, the Generator has to learn to produce plausible trajectories. A simple visual representation of both models is shown in Fig 3. To understand the architecture of the Discriminator, we can divide it into three sub-networks. The first sub-network takes as input only the trajectory produced by the Generator, and it learns some features without taking the observed trajectory into account. This subnetwork learns if the Generator can produce meaningful trajectories, regardless of whether they are meaningful predictions for the observed ones. It contains an LSTM layer with 64 neurons (32 for 4-point predictions) followed by a Dense layer with 32 neurons (16 for 4-point predictions).
The second sub-network learns to model the entire trajectory. It learns if the trajectory produced by the Generator is a good prediction for the input trajectory. Its input is the concatenation between the observed trajectory and the trajectory produced by the Generator. It contains an LSTM layer with 64 neurons followed by a Dense layer with 32 neurons.
The third sub-network takes as input the concatenated outputs of the two other sub-networks and produces the final output (real or fake). It contains 3 Dense layers with 32, 16, and 1 neuron each. The activation function of the final Dense layer is the sigmoid.
Similarly, to understand the architecture of the Generator, we can divide it into two subnetworks. The first sub-network takes as input the observed trajectory and learns some basic features. This sub-network doesn't consider the latent input used for generating new samples, meaning that the output of this sub-network will always be the same for a given observed trajectory. It contains an LSTM layer with 32 neurons, followed by another LSTM layer with 16 neurons, followed by a Dense layer with 16 neurons.
The second sub-network takes as input the output of the previous one, concatenated with the latent input. The latent input allows for a mapping between the latent space and the output space, meaning that multiple predictions can be made. The latent input is a vector with 2 values, as the low dimensionality helps to avoid that the latent input overshadows the output of the first sub-network, and mitigates the risk of mode collapse. The sub-network contains a Dense layer with 16 neurons, followed by a final Dense layer with as many neurons as the number of timesteps to predict (4 or 12) multiplied by the number of coordinates (2). The activation function of the final Dense layer is the hyperbolic tangent.
The Generator produces the non-observed part of the trajectory, which is concatenated to the observed part and passed as input to the Discriminator.
When training the GAN, each epoch consists of a Generator step and a Discriminator step. When training the Generator, we feed the input to the entire GAN model while setting the Discriminator weights as not trainable. Both training steps produce the same type of output, i.e. the predicted "real" or "fake" class, but the loss function is the opposite, as the Generator aims to fool the Discriminator into classifying all its samples as "real".
The difference between using an LSTM and a GAN for prediction lies in the objective function. The LSTM learns an average behavior because it minimizes the average error between all the predictions. The GAN learns to produce realistic samples corresponding to specific behaviors. This is why using the GAN-1 model (that only makes one prediction) in a multimodal scenario is counterproductive; even if the model is good at generating plausible trajectories, it can produce a different behavior from the real one, resulting in high errors.

Multimodal generation.
One example of multimodal generation in trajectory prediction is the SGAN-20 model by Gupta et al. in [28] in the domain of human trajectory prediction. They propose the SGAN-20 model, which uses the Social GAN to produce 20 different samples and selects the best one by comparing the errors with the ground truth. One major problem with this approach is that it can't be compared with unimodal predictions, because part of the reason why SGAN-20 outperforms the unimodal SGAN is that it simply makes more attempts.
This problem can't be completely avoided since a better approach to address multimodality does not exist yet, but our proposed GAN-3 partially decreases the impact of chance on error decreases.
The problem with methods like SGAN-20 is that we cannot be sure that the 20 samples describe 20 different behaviors. SGAN-20 assumes that the behaviors are all different and exactly one of them is correct. This is seldom true and the results are unreliable.
GAN-3 is similar to SGAN-20 and it too picks the best result w.r.t. the ground truth, but we implement two major differences. First, we reduce the number of samples. Assuming that each sample should represent a different behavior, a number of 20 possible behaviors is excessive for trajectory prediction. Furthermore, it depends on what is considered to be a different behavior. For example, potential behaviors could include turning left, turning right, turning left and stopping, or turning left and accelerating.
The second difference is that we compare the generated samples to determine whether they represent different behaviors. The problem is that there are different levels of abstraction when considering different behaviors, so we set a heuristic to quantify the level of abstraction we want to use in our evaluation.
We approximate the distance between two behaviors and use it to define a threshold T 2 (0,1) and choose the level of abstraction that differentiates two behaviors. With a high T there are just a few possible behaviors (perhaps with a simpler semantic representation), while with a low T there are many possible ones (with a more complex semantic representation). If T is too high, the problem becomes unimodal, so using an LSTM would be more convenient. If T is too low, every generated trajectory would be a different behavior, and it would be the same as SGAN-20.
In general, we define our GAN-K model as a model with the architecture previously defined, that when evaluated produces K samples. Samples are produced sequentially. When the sample i is generated, it is compared with the previous i − 1 samples by checking the distance from those behaviors. If the distance is too low for at least one behavior, the sample is discarded and generated again. After a given number of attempts, that we set to 100, we stop the generation and only keep the first i − 1 samples.
The distance between two behaviors is calculated as the ratio between the lowest of the two errors and the highest. If this value is higher than 1 − T, the new sample is discarded. The errors we compare are the values of the N-ADE metric. In mathematical terms, the multimodal GAN generates K samples that satisfy the condition in Formula (9) given a threshold T and the N-ADE error E i for the sample i. Our heuristic to distinguish between different behavior is to check if the ratio between the two errors is high. Intuitively, samples corresponding to the same behavior should have similar errors, although the opposite is not necessarily true. In our case, with K = 3, we choose threshold T = 50%.
E i E j <¼ 1 À T 8i; j 2 f1; :::; Kg; i 6 ¼ j ð9Þ To clarify, GAN-3 cannot be used for unimodal prediction, as in real-world scenarios we don't have access to the ground truth. We use this method to analyze multimodality in a given environment. Moreover, GAN-3 is a stochastic method that can potentially generate different sets of samples that satisfy the condition in Formula (9). Formula (9) uses the ground truth to calculate error (similarly to SGAN-20 [28]), which is not available for unimodal prediction purposes.

Results and discussion
In this Section, we evaluate our predictive and generative models on the four datasets we proposed. We expect GAN-1 to be outperformed by both LSTM and GAN-3. In fact, the former should predict an average behavior that is more accurate than a random behavior, while the latter should find the correct behavior. The evaluation of GAN-1 is made for the only purpose of having baseline results for generative models.
The models are evaluated using both the ADE/FDE and N-ADE/N-FDE metrics on both 4-point and 12-point predictions.
Related works do not take multimodality into account (i.e. they do not produce multiple predictions) and use biased metrics; for this reason, a direct quantitative comparison between those approaches and ours does not provide a meaningful interpretation. In fact, in [31], the authors introduced a path inference method for low-frequency FCD. They have implemented and compared three methods: WSPA [32], ST-matching [33], and EMM [34]. However, the proposed method is specifically designed for cases with minimum information (i.e. latitude, longitude, and timestamp).
Altinasi et al. [3] have also studied FCD, but their paper aimed at detecting critical patterns in urban roads. Traffic patterns were defined by considering the Level of Service, which is based on segment speed on urban arterial roads. Attempts on predicting trajectories data for estimating traffic conditions from a large historical FCD are performed in [35]. The goal was twofold: on one hand the estimation of congestion zones on a large road network, on the other the estimation of travel times within congestion zones by the time-varying Travel Time Indexes (TTIs). Recently, in [36], the authors proposed a trajectory restoration algorithm, based on geometry map matching algorithms. Nevertheless, even in this case these approaches doesn not consider the multimodality.
For the aforementioned reason, we compare our GAN-3 model with an LSTM model, since the latter is the type generally used by related works. The advantage of using our GAN-3 model over those works can be assessed by comparing it with our LSTM model, using both standard metrics and normalized ones.
In Section 4.1, we show the results of these experiments, while in Section 4.2, we provide an analysis of the traffic flow, with the purpose of mapping those results into the physical world.

Experiments with LSTM and GAN
In the following tables, each column refers to a dataset, so that in a given column (e.g. the dataset Rome), we have the values of the error metrics for the models trained and evaluated on that given dataset.
In Tables 2 and 3 we show the values of ADE/FDE and N-ADE/N-FDE, respectively. We repeated the experiments for the three models (LSTM, GAN-1, GAN-3) and for both short predictions (4 points) and long predictions (12 points).
As expected, GAN-1 always performs the worst, because neither it minimizes the error by picking an "average behavior" like LSTM does, nor it picks the correct behavior as GAN-3 usually does.
In most cases, LSTM outperforms GAN-3, meaning that the latter is not a replacement for the former; in fact, both methods have advantages and disadvantages in different contexts.
In the 4-point predictions, LSTM outperforms GAN-3 in all cases but one: N-FDE on the Naples dataset. The reason why LSTM is better for short predictions is that uncertainty grows over time, and multimodality is just aleatoric uncertainty. As the model tries to predict more time-steps, the number of possible behaviors grows, so the "average behavior" predicted by the LSTM becomes less accurate than the correct behavior identified by GAN-3. On the other hand, when the number of time-steps is low, the LSTM prediction is more accurate, since it minimizes the square error instead of just generating plausible trajectories as the GAN does.
In the 12-point predictions, LSTM still outperforms GAN-3 on the average errors (ADE and N-ADE). GAN-3 outperforms LSTM on the Rome dataset (FDE, N-FDE) and the Turin dataset (N-FDE). Again, the reason why is GAN-3 performs better on final errors while LSTM performs better on average errors is that the cumulative uncertainty of the entire trajectory adds up to the final point. Since this only happens on the Rome and Turin datasets, we can deduce that these cities have a road graph with a highly multimodal configuration, allowing for more possible routes.
In Table 4, we show the improvement of the GAN-3 error over the GAN-1 error. The higher the ratio, the higher is the improvement. A high ratio means that GAN-3 generates better solutions than GAN-1 and it indicates multimodality, as different trajectories are generated.
We can make three main observations on this table. First, the improvement is better on long trajectories than on short ones, as expected. In fact, long trajectories have more uncertainty and there is more room for improvement with a multimodal generative approach.
Second, normalized metrics show a better improvement than standard ones. This follows directly from how we defined the GAN-3 model, which measures the distance between two trajectories using the N-ADE and not the ADE. In this way, it optimizes improvements over more high-uncertainty trajectories, as uncertainty is correlated with length and variance.  Third, the best improvements are achieved in the Rome and Turin datasets. Those are the same datasets where GAN-3 outperforms LSTM, as we saw in Tables 2 and 3.
Multimodality can also be analyzed from the error growth. In Table 5, we show the N-FDE / N-ADE ratio of these experiments. A higher ratio means that the error grows faster, so the cumulative uncertainty over several time-steps is higher.
In most cases, as expected, this ratio is lower for GAN-3 experiments. It means that, although the LSTM error is generally lower than the GAN error, it tends to grow faster. One possible interpretation of this phenomenon is that LSTM and GAN-3 having different types of error: while the GAN-3 error can be associated with a less accurate model (in general, GANs are harder to train than LSTM networks), the LSTM error is due to a divergence from the actual behavior as the uncertainty increases with the number of time-steps.
As we saw in Section 2, recent works on FCD trajectory prediction are based on LSTM models that minimize the Root Mean Square Error (RMSE). From our results, we can conclude that the LSTM model is better suited for unimodal scenarios (where choices are heavily constrained) and short-term predictions, while GAN-3 works best in scenarios with high uncertainty, especially when the number of time-steps grows larger.
In these cases, finding an "average behavior", as the LSTM does, is not a satisfying solution. In fact, the more accurate the model tries to be, the worse it performs, because the average behavior diverges from any actual behavior. Our work is the first that proposes a generative approach for FCD trajectory prediction, and we have shown that we can address the problem of multimodality by generating a set of multiple realistic predictions.

Traffic flow state analysis
As mentioned in the introduction, traffic flow states calculated from high-resolution trajectories are an important and critical information for traffic control. Predicted trajectories Table 4. GAN-1 / GAN-3 ratio: The higher the ratio, the higher is the improvement of GAN-3 over GAN-1, meaning that better solutions are found by the former (a sign of multimodality).  The space-time analysis has been performed on the 12-point predictions: the longer prediction is considered better fitting the needs of a traffic control center in managing traffic flows in urban contexts. The analyzed trajectories are those predicted using the LSTM network because this model produces good results on all datasets, while GAN-3 outperforms LSTM only under particular conditions, so they can be better analyzed in this context. Fig 5 displays the real positions plotted on top of the predicted ones, while Table 6 displays the distribution of planimetric distance values between real positions and the nearest predicted ones. This is a different   kind of distance from the ones used to validate the predictions in Section 4.1, because this is a spatial analysis while the previous one was temporal. Unlike the previously proposed metrics, we don't use this type of distance to evaluate the models because there is no direct correspondence between real and predicted values. Nevertheless, those value distributions allow us to perform some considerations:

Method
• There is a clear correlation between distance values and the dimension of the considered city: the size of the Rome municipality is 1,287 km 2 , compared to 182 km 2 of the Milan municipality, 130 km 2 of the Turin municipality, and 119 km 2 of the Naples municipality. On a wider area, predicted positions are potentially more dispersed and this pattern is detectable in the statistics of the four cities.
• Despite being only the third in terms of geographical extension, Turin has a standard deviation value much higher than Milan (almost 30% more extended) and Naples (only 10% less extended). This higher dispersion of the values can be correlated to the particular morphology of the municipality area: the E-SE part of the municipality lies in a hilly area, with a limited number of roads, very sparse patterns, and a low number of trajectories.
Figs 6 to 9 display the results of the trend (top) and hot spot (bottom) analysis on the four cities: Milan, Naples, Rome, and Turin. Point aggregation has been made on hexagons with an height of 500 m. In all figures, the left map is obtained from the real values, while the right one is generated from the 12-point LSTM predictions.
As far as the trend analysis is concerned (top of Figs 6 to 9), overall patterns show a low degree of similarity, meaning that V2I analysis is needed in future works. Furthermore, our predictions didn't take into account information from the road graph. We can observe that predicted trajectories are not confined to existing infrastructure elements (i.e. road edges) and, therefore, are subject to be placed on areas not dedicated to vehicles, consequently creating more disperse point clouds.
This aspect is highlighted both in Fig 5 and in Table 7: in three out of the four considered cities (Milan, Naples, and Rome), the total number of non-empty cells is lower for the analysis based on real values rather than predicted ones. In the case of Rome, the difference is significant and this can be explained by the size of the city (and, consequently, of the dataset), almost one order of magnitude larger than the other three. Turin is the only city with a significant decrease in the total number of non-empty cells in the predicted scenario and this again is partially due to its morphology.
The hot spot analysis (bottom of Figs 6 to 9) shows the highest levels of correlation between real and predicted positions, even if this one too is affected by the non-confinement of the latter, leading generally to larger proportions of areas with some detected patterns.
This type of analysis is clearly conditioned by the size of the space-time cube cells: e.g., increasing hexagons resolution (i.e. decreasing the hexagon height) to 100 m, the trend comparison shows narrower differences between real and predicted values in almost all trend classes (Table 8). This is mainly explained by the fact that, reducing the size of the cells, the probability of having a relevant number of positions in each cell is lower, as well as the possibility of detecting trends: comparing Tables 7 and 8, the percentage of cells with no significant trend is in the range of 52.26-83.92% for the 500 m aggregation and 89.31-98.00% for the 100 m aggregation. Consequently, percentages of cells with some trend detected are extremely marginal when using 100 m aggregation and differences are narrower. Table 9 compares the overall accuracy over the 4 cities for both the 500 m and 100 m aggregation levels. This confirms that the smaller aggregation size provides significantly higher correspondences in the trend analysis.

Conclusion and future works
In this work, we addressed challenges and aspects of the FCD trajectory prediction problem that have been neglected by other works in this field. We proposed new datasets with a large number of data points and multimodal scenarios, as well as new aggregate metrics that reduce bias by normalizing ADE and FDE to critical properties of the trajectories.
We addressed the problem of multimodality in a way that is missing in the current state of the art, proposing the generative model GAN-3. Although the standard LSTM model is superior in unimodal scenarios, most real-world ones are multimodal and require a different approach. Our GAN-3 generates different plausible solutions instead of a single average one, and it outperforms LSTM in those cases where uncertainty is high. GAN-3 doesn't replace LSTM, as the latter is still superior in unimodal scenarios, but each method is best suited for different contexts.
Since this is the first work to use generative methods in FCD trajectory prediction, there are many possible directions for future research in the field. One possible improvement is the analysis of longer trajectories. This would be impossible in multimodal scenarios with a standard LSTM, but generative models should be able to tackle this challenge, enumerating different possible behaviors in a long time frame.
Another possible improvement is the study of V2V interaction that we mentioned among the challenges of FCD trajectory prediction. Different vehicles influence each other's trajectories in a major way, taking this aspect into account would largely reduce uncertainty (and the problems that arise from multimodality) and improve accuracy. Another class of improvements would be the study of V2I interaction by making use of additional information to improve accuracy, such as the road network and related traffic restrictions, that heavily constrains the possible path a vehicle can follow. This type of information has been ignored in this work, mainly for the purpose of comparing LSTM and GAN-3 in a purely data-driven fashion. Nevertheless, a hybrid approach can bring several advantages in practical scenarios.
Furthermore, we only evaluated the models on the datasets they have been trained on, future works could include experiments that train and evaluate models on different datasets. If different datasets have very different results, it means that predictions depend more on the environment than on general properties of vehicle trajectories, and that would require further research in V2I interaction.
The penetration rate of CAVS should be adequately kept into consideration because it can highly influence the performances of UTCs based on those data: Ajmar et al. in [13] provided some insights comparing floating car data (FCD) with low penetration rates with data acquired by fixed sensors, but more research studies on the impact of low penetration rates are still required. Nevertheless, CAVs penetration rates and smart road development is expected to grow systematically in the next years and the methodology proposed here can be easily scaled and implicitly benefit in accuracy by the availability of more complete data.
Supporting information S1 File. (ZIP) Table 7. Variation in % of the cells in each trend class in the 4 cities computed using the 500 m aggregation: Positive values are classes where real values have more occurrences than predicted ones. In brackets the real and the predicted percentage values: for the total number of cells with at least one value, the real and the predicted total numbers are reported in brackets.