Local causal dynamic integrated global mode guidance transformer network for pedestrian trajectory prediction

Sunwei Gong; Yinxin Bao; Yingyan Hou; Wanxuan Lu; Quan Shi

doi:10.1371/journal.pone.0347049

Abstract

Pedestrian trajectory prediction is crucial for autonomous vehicles, which face challenges in integrating complex spatiotemporal dynamics, managing multi-modal future behaviors, and ensuring real-time performance. This paper introduces the Local-Global Collaborative Transformer Network (LGCMT) to address these issues. LGCMT features an innovative local-global collaborative encoder comprising two key modules: a Sparse Causal Temporal Attention (SCT-MSA) module, designed to extract fine-grained local causal dynamics, and a Global Context Encoder that utilizes Cosine Similarity Attention to capture macro-level spatiotemporal patterns. For multi-modal prediction, LGCMT employs a parallel Non-Autoregressive (NAR) decoder guided by a motion pattern library, which efficiently generates diverse trajectory candidates covering key future likelihoods. Extensive evaluations on the standard ETH/UCY benchmarks and the large-scale Stanford Drone Dataset (SDD) demonstrate LGCMT’s robust performance. On ETH/UCY, the model improves ADE and FDE by approximately 4.8% and 5.6% compared to the competitive TUTR baseline. Moreover, the proposed framework achieves exceptional inference efficiency, establishing LGCMT as a potent solution that effectively balances accuracy, multi-modality, and operational speed for real-time applications.

Citation: Gong S, Bao Y, Hou Y, Lu W, Shi Q (2026) Local causal dynamic integrated global mode guidance transformer network for pedestrian trajectory prediction. PLoS One 21(4): e0347049. https://doi.org/10.1371/journal.pone.0347049

Editor: Farman Ullah, UAEU: United Arab Emirates University, UNITED ARAB EMIRATES

Received: July 14, 2025; Accepted: March 26, 2026; Published: April 20, 2026

Copyright: © 2026 Gong et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant results are reported within the paper. The author-generated code underlying the findings, together with selected pre-trained model weights and documentation for installation and use, is openly available on GitHub at https://github.com/NTU24pg/LGCMT under the MIT License. The pre-processed datasets used in this study, including ETH, UCY, and SDD, are publicly available on Zenodo (DOI: https://doi.org/10.5281/zenodo.15691159) to ensure long-term accessibility. The original ETH, UCY, and SDD benchmark datasets are publicly available from their respective original sources.

Funding: This work was supported by the National Natural Science Foundation of China (Grant 62476145); the Humanity and Social Science Foundation of Ministry of Education of China (Grant 24YJAZH126); the 6th ‘333 Talents’ Technology Research and Development Talent Foundation of Jiangsu Province; the Transportation Technology and Achievement Transformation Foundation of Jiangsu Province (Grant 2024G01); the Key Laboratory of Target Cognition and Application Technology (Grant 2023-CXPTLC-005); and the Postgraduate Research & Practice Innovation Program of Jiangsu Province (Grant SJCX25_2019). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. No authors received a salary from any of the funders.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

Accurate and efficient pedestrian trajectory prediction is essential for safe autonomous navigation [1,2] and effective human-robot interaction [3]. In dynamic environments, decision-making systems must anticipate future states by leveraging historical motion patterns and social cues [4]. Although deep learning has substantially advanced predictive performance, real-world deployment still hinges on addressing a key trade-off: achieving strong representational capacity to capture complex spatiotemporal dynamics while maintaining inference efficiency to satisfy real-time latency requirements.

Modeling pedestrian motion faces two structural challenges regarding feature representation and output generation. First, pedestrian movement is governed by two distinct temporal scales: local causal dynamics, representing immediate kinematic reactions to surroundings, and global motion trends, representing consistent long-term directionality. Existing architectures often struggle to balance these. For instance, Transformer-based models like AgentFormer [5] utilize dense self-attention across the full sequence, which entangles local and global cues within a unified attention map rather than explicitly separating them. Conversely, methods like STAR [6] process spatial and temporal dimensions via separate stages. While this design is structured, it lacks an explicit mechanism to disentangle multi-scale temporal dynamics within its processing pipeline.

Second, the inherent multimodality of human behavior—where a single history can lead to multiple plausible futures—demands modeling a distribution of possible trajectories. As illustrated in Fig 1, social interactions create diverse plausible paths from the same observed history. However, generating these hypotheses efficiently remains difficult. Generative approaches based on diffusion models [7] or GANs [8] often incur high computational overhead due to iterative denoising or complex sampling. Among Transformer-based methods, high-fidelity approaches frequently rely on autoregressive (AR) decoding [5,9], which requires sequential forward passes proportional to the prediction horizon T_pred. This linear scaling of latency with forecast length poses a significant challenge for safety-critical, real-time applications.

Download:

Fig 1. Social interactions induce multi-modal future uncertainty.

Given the same observed history, different interaction outcomes can lead to multiple plausible future trajectories (red dashed), while only one is realized as the ground truth (blue). This illustration motivates the need to predict multiple modes to capture the eventual outcome.

https://doi.org/10.1371/journal.pone.0347049.g001

To address these limitations, we propose the LGCMT, a framework designed to balance structural efficiency with predictive diversity. LGCMT introduces a dual-branch encoder to explicitly model the hierarchy of motion: a SCT-MSA branch captures short-term, history-dependent kinematic adjustments, while a Cosine Similarity Attention branch extracts macro-level directional trends. To ensure scalability in crowded scenes, we incorporate a distance-based spatial filtering strategy that reduces interaction complexity from quadratic to linear.

For decoding, we replace the standard autoregressive loop with a Library-Guided Non-Autoregressive (NAR) decoder. By retrieving structured motion patterns from a library learned on training trajectories, the model generates a diverse set of candidates in a single decoding pass, reducing the number of forward passes from T_pred to 1.

Our specific contributions are:

Local-Global Collaborative Encoder: We propose a specialized encoder that disentangles local causal dynamics from global motion trends using parallel sparse-causal and cosine-similarity attention mechanisms. This design explicitly separates immediate reactions from long-term directionality at the attention level.
Library-Guided NAR Decoding: We introduce a parallel prediction framework that utilizes a learned motion pattern library to guide non-autoregressive generation. This approach eliminates the latency of sequential decoding and avoids the iterative overhead associated with diffusion-style models.
Efficiency and Accuracy Balance: Validated on ETH/UCY and the large-scale SDD benchmarks, LGCMT achieves competitive accuracy compared to recent baselines. It demonstrates exceptional inference speed (approximately 2.8 ms per sample), making it highly suitable for real-time deployment in complex environments.

2 Related work

This section reviews existing literature focusing on two critical dimensions: spatiotemporal representation learning and the trade-off between multimodality and inference efficiency.

2.1 Spatiotemporal representation and interaction modeling

Early data-driven approaches utilized Recurrent Neural Networks (RNNs) to model sequential dependencies. Social-LSTM [10] introduced social pooling to aggregate neighbor information, a concept refined by subsequent attention-based RNNs [11,12]. While Transformers have gained traction, RNN architectures continue to evolve; for instance, the recent AFC-RNN [13] incorporates adaptive forgetting controllers to explicitly manage historical redundancy, demonstrating the continued relevance of recurrent structures for temporal modeling.

Graph Neural Networks (GNNs) [14] offer a flexible topology for modeling interactions, treating agents as nodes. Approaches like SGCN [15] and Social-STGCNN [16] leverage sparse graph convolutions to capture social effects. However, GNNs typically prioritize spatial topology, often employing simpler temporal aggregation mechanisms compared to attention-based sequence models.

The Transformer architecture [17] addresses long-range dependencies via self-attention. Early adaptations like STAR [6] interleaved spatial graphs with temporal Transformers. More recent works focus on enhancing robustness via specific constraints or additional modules. For example, TP-EGT [18] introduces a collision-aware Graph Transformer within a multi-task framework to explicitly predict interaction probabilities. Similarly, APT-TP [19] utilizes semantic maps and inverse reinforcement learning to enforce fine-grained trajectory-scene consistency. These constraint-aware Transformers primarily add supervision signals, modules, or extra inputs on top of Transformer backbones, instead of explicitly separating local causal dynamics and global trends at the attention level. In contrast, our LGCMT focuses on the intrinsic efficiency of the attention structure itself, employing sparse causal attention to strictly model the arrow of time for local dynamics, distinct from global trend analysis.

2.2 Multimodality and inference efficiency

Pedestrian trajectory prediction is inherently multimodal. Generative Models handle this by learning latent distributions. GANs [3,8] and CVAEs [20,21] sample from latent noise to generate diverse paths. Recently, Diffusion Models [7,22] have achieved high fidelity in distribution modeling but typically require multiple reverse steps for denoising, increasing inference cost. Normalizing Flows [23] offer exact likelihood estimation but often involve complex invertible transformations.

Deterministic Multi-Hypothesis approaches offer an alternative. Many Transformer-based models, such as AgentFormer [5], employ Autoregressive (AR) decoding to generate multimodal distributions. While AR ensures temporal coherence, it requires T_pred sequential forward passes, creating a latency bottleneck. Non-Autoregressive (NAR) approaches, such as query-based set prediction methods like TUTR [24], attempt to generate all time steps simultaneously. LGCMT extends this NAR paradigm by conditioning predictions on a discrete, learned motion pattern library. This structured guidance aims to ensure trajectory plausibility without the computational overhead of iterative sampling or sequential decoding loops.

3 Materials and methods

This section elaborates on the LGCMT, a model developed to address key challenges in pedestrian trajectory prediction. The proposed framework integrates specialized encoding, multi-mode hypothesis generation, and parallel decoding mechanisms. Subsequent subsections detail the problem formulation, the overall architecture, the design of each core component, and the training methodology.

3.1 Problem formulation and model overview

The objective of pedestrian trajectory prediction is defined as follows: Given the observed historical position sequence for a target pedestrian i over the past T_obs time steps, denoted as , and considering the historical trajectories of neighboring pedestrians . To efficiently handle crowded scenes and filter out irrelevant interactions, we explicitly define the neighbor set based on a fixed spatial radius rather than considering all pedestrians in the scene. Specifically, a pedestrian j is included in if and only if their distance to the target i is within a threshold :

(1)

where denotes the set of all pedestrians in the scene. This distance-based filtering strategy effectively reduces the computational complexity from quadratic relative to the crowd size to linear relative to the number of relevant neighbors, preventing noise from distant agents and ensuring scalability in dense environments.

The task is to predict a set of K plausible future trajectories for pedestrian i over the next T_pred time steps. This output set is represented as , where each individual trajectory hypothesis signifies a distinct future path. This formulation inherently accommodates the multi-mode nature of pedestrian movement.

Fig 2 illustrates the LGCMT architecture. Input 2D coordinates for the target pedestrian X_i and its neighbors are first embedded into high-dimensional features. The target’s features are then processed by our local-global collaborative encoder (described in the Pedestrian history encoding: A local-global collaborative approach subsection). This encoder has two parallel branches: a Causal Temporal Encoder (CTE) with SCT-MSA to capture local dynamics, and a Global Context Encoder (GCE) with cosine similarity attention for global patterns. Their fused output, , represents the target’s history. Correspondingly, the historical trajectories of neighbors X_j are processed through a dedicated embedding layer to obtain their feature representations H_j (see the Socially-aware parallel trajectory decoding subsection for details). Next, a pattern scoring module (CLS Head, as detailed in the Structured multi-mode hypothesis generation via motion pattern library subsection) compares against a pre-constructed motion pattern library , selecting the top-K motion patterns and their embeddings . Finally, for each selected pattern, a socially-aware non-autoregressive decoding process (detailed in the Socially-aware parallel trajectory decoding subsection) is initiated. It leverages the target’s mode-specific feature representation, incorporates social context derived from neighbor features H_j via an attention mechanism, and then utilizes a regression head (REG Head) to generate the complete future trajectory Y_(i,m) in a single step, outputting K trajectory candidates .

Download:

Fig 2. Overview of the proposed LGCMT framework.

The target trajectory X_i and neighbor trajectories are embedded and processed by a local–global collaborative encoder, consisting of a Causal Temporal Encoder (CTE) with SCT-MSA for local dynamics and a Global Context Encoder (GCE) for global motion trends. A motion-pattern library is scored by the CLS head to select top-K modes, and a socially-aware non-autoregressive decoder (REG head) generates K future trajectory hypotheses in parallel.

https://doi.org/10.1371/journal.pone.0347049.g002

3.2 Pedestrian history encoding: A local-global collaborative approach

Accurately predicting future pedestrian trajectories hinges on effectively understanding their historical motion. Pedestrian movement is not random; it often comprises a blend of immediate, fine-grained maneuvers and broader, macro-level intentions. To capture this inherent duality, we introduce a local-global collaborative encoder. This encoder processes the observed trajectory , where represents the 2D coordinates of pedestrian i at time t, and T_obs is the length of the observation period.

The encoding process starts by considering each of the N_lib pre-defined motion patterns from the library (see the Structured multi-mode hypothesis generation via motion pattern library subsection). For each pattern M_k, a complete candidate sequence S_k is formed by concatenating the observed history X_i with the pattern coordinates M_k, spanning the combined observation and prediction horizon (). These N_lib candidate sequences are then processed in parallel through two distinct initial embedding pathways, corresponding to the CTE and GCE branches, as illustrated in Fig 2.

For the CTE path, designed to capture local dynamics, each candidate sequence S_k is processed by a Temporal Embedding module. This module transforms each 2D coordinate p_t within S_k into a d_model-dimensional feature vector :

(2)

This yields N_lib initial embedded sequences , which form the input tensor for the CTE branch.

Concurrently, for the GCE path aiming to capture global context, each candidate sequence S_k is processed by a Sequence Embedding module. This module takes the entire sequence S_k as input and generates a single d_model-dimensional feature vector representing the overall context of that specific mode hypothesis:

(3)

The collection of these N_lib vectors, forming a tensor , serves as the input representation for the GCE branch. These distinct embedding strategies ensure that the subsequent CTE and GCE layers receive input features tailored to their respective tasks of local and global pattern extraction.

3.2.1 Causal temporal encoder (CTE) for efficient local dynamic extraction.

The first branch, our CTE, is designed to meticulously capture the fine-grained temporal dynamics from the recent history of a pedestrian’s movement. The cornerstone of the CTE is the SCT-MSA module, illustrated in Fig 3. The design of SCT-MSA intrinsically respects the natural arrow of time in motion by being causal; that is, the feature representation at any time step t is influenced exclusively by past and present information (). Furthermore, SCT-MSA introduces sparsity by confining its attention mechanism to a defined local historical window of size R_window. For each time step t, it considers information only from . This focus on localized, recent history is crucial for capturing immediate motion cues. The combination of causality and local sparse attention significantly enhances computational efficiency, reducing the self-attention complexity from per layer, typical of standard Transformers [17], to a more favorable . This makes the CTE well-suited for processing observation sequences where local dependencies are crucial, particularly with large T_obs and small R_window.

Download:

Fig 3. Detailed structure of the Sparse Causal Temporal Multi-head Self-Attention (SCT-MSA) module.

The mechanism restricts attention to a local sliding window of size R_window (shaded grey area), ensuring that the feature representation at time t depends only on the recent history [t − R_window, t]. This design enforces causality and reduces computational complexity compared to full self-attention.

https://doi.org/10.1371/journal.pone.0347049.g003

Within each SCT-MSA layer, the input sequence (e.g., for the initial layer) is transformed into Query (Q), Key (K), and Value (V) vectors via distinct linear projections ():

(4)

where hⁱⁿ denotes the feature input to the current layer. Attention scores are then computed using scaled dot-product attention, where a mask, , rigorously enforces both causality and the sparse local window by permitting attention only to positions within the allowed range:

(5)

Here, is the dimension per attention head (N_h heads total). After Softmax normalization, the output feature is a weighted sum of Value vectors from the defined causal sparse window:

(6)

The SCT-MSA module is then completed with residual connections, Layer Normalization, and a position-wise Feed-Forward Network (FFN), adhering to the standard Transformer block structure [17]. Stacking L_CTE such layers empowers the CTE branch to produce , a feature sequence rich in detailed, short-term motion characteristics.

3.2.2 Global context encoder (GCE) for macro-level patterns.

The second branch of our encoder, the GCE, focuses on discerning broader, macro-level patterns inherent to the target pedestrian’s own movement. It is crucial to clarify that “Global” in this context refers to the temporal global scope of the trajectory sequence, rather than the spatial global scope of the crowd.

Unlike the Social Decoder which handles agent-agent interactions, the GCE is strictly an intra-agent module. It processes the entire candidate sequence (comprising the observed history and a hypothesized future motion mode) as a single input. This holistic view allows the GCE to extract overarching behavioral trends specific to the target’s individual motion intent, independent of social interactions. By isolating the individual’s long-term goal from transient social perturbations, the GCE provides a stable representation of long-term intent.

The GCE consists of L_GCE Transformer encoder layers [17], distinguished by its use of Cosine Similarity Attention. Query (Q_t) and Key () vectors are generated as in the CTE. Their similarity, however, is measured using cosine similarity, which emphasizes their directional relationship rather than dot product magnitude, potentially offering a better grasp of overall motion intent:

(7)

Attention scores are derived by scaling this similarity with a learnable parameter (an optional mask is typically unused to allow full global interaction):

(8)

Following Softmax normalization:

(9)

The output feature aggregates information from all historical time steps:

(10)

Stacking L_GCE such modified Transformer layers, each incorporating standard Layer Normalization and an FFN, yields , a sequence encoding the global contextual information of the trajectory.

3.2.3 Feature fusion for comprehensive understanding and computational considerations.

Having extracted detailed local dynamics () with the CTE and broad global patterns () with the GCE, these complementary representations are integrated to form a holistic understanding of the pedestrian’s historical movement. This is achieved through an element-wise summation:

(11)

This serves as the comprehensive historical representation for subsequent model stages.

This dual-branch architecture reflects a deliberate design choice regarding computational complexity. The CTE, with its SCT-MSA module, reduces per-layer attention complexity to approximately from the standard . Its overall complexity (including FFNs of hidden dimension d_ff) is roughly . This renders the CTE highly efficient for long sequences where . Conversely, the GCE maintains per-layer attention complexity to capture all-pairs global context, leading to an overall GCE complexity of approximately . The fusion step adds negligible complexity. This design allows our model to balance computational load: the CTE efficiently distills local causal dynamic with complexity linear in T_obs (for fixed R_window), while the GCE, though more intensive, extracts indispensable global context, achieving a synergistic blend of expressive power and efficiency.

3.3 Structured multi-mode hypothesis generation via motion pattern library

Pedestrian behavior is inherently multi-mode, meaning individuals often have several plausible future paths. To effectively address this diversity while avoiding the inefficiencies and complex post-processing associated with some traditional generative models, we introduce a structured hypothesis generation approach. This method is centered on a pre-constructed motion pattern library, a collection denoted as . This library comprises N_lib distinct, typical future motion patterns, each M_k representing a sequence of 2D coordinates over the prediction horizon T_pred.

The creation of this motion pattern library is an offline process performed once using the training dataset. Initially, a large corpus of future trajectory segments, each spanning T_pred time steps, is collected. To ensure that the learned patterns represent general motion characteristics rather than absolute starting positions, each trajectory segment is normalized, for instance, by translating its initial point to the origin. Following normalization, K-Means clustering, a standard unsupervised learning algorithm, is applied to these trajectory segments. K-Means groups similar trajectories together, and the centroid of each resulting cluster becomes a distinct motion pattern M_k in our library . The number of patterns, N_lib, is a hyperparameter chosen based on dataset characteristics and desired granularity. Each raw 2D coordinate sequence M_k is then transformed into a learnable d_model-dimensional feature vector, , using an embedding layer denoted . This allows the model to work with richer representations of these patterns.

During inference, the model utilizes the fused feature tensor obtained after the local-global feature fusion step (see the Feature fusion for comprehensive understanding and computational considerations subsubsection). This tensor, with dimensions reflecting the batch size, the N_lib mode hypotheses, and the feature dimension (B × N_lib × d_model), encapsulates the comprehensive representation for each potential future scenario. This entire tensor is then directly fed into the scoring network, MLP_score (referred to as the CLS Head in Fig 2). The scoring network, implemented as a linear layer mapping from d_model to 1, operates independently on the feature vector corresponding to each of the N_lib mode hypotheses:

(12)

where Scores_i is now understood as a tensor of shape B × N_lib, containing the calculated score for each of the N_lib patterns based on their respective fused representations.

These scores are converted into a probability distribution p_i over the patterns via the Softmax function, where p_(i,k) indicates the likelihood of pattern M_k for pedestrian i:

(13)

For training, target modes and probabilities guide the learning of L_pred and L_mode (detailed in the Training strategy and loss functions subsection). In inference, the top-K patterns, identified by , and their embeddings direct the parallel generation of multiple trajectory hypotheses.

This library-based mechanism enhances prediction quality and interpretability by incorporating explicit prior knowledge of common behaviors, guiding generation towards plausible outcomes. While the library’s coverage limits its ability to represent entirely novel behaviors and discretizing motion might lose some nuances, its careful construction is expected to significantly improve the generation of diverse and realistic trajectory candidates. The quality and representativeness of the library are key considerations.

3.4 Socially-aware parallel trajectory decoding

Effective trajectory prediction requires not only understanding an individual’s past movement and intended goals but also their dynamic interactions with surrounding individuals. The Socially-Aware Parallel Trajectory Decoding approach proposed in this paper addresses this by simultaneously generating multiple future path hypotheses for a target pedestrian, each explicitly conditioned on the dynamic social context. This capability is vital for creating realistic and reliable predictions, particularly in scenarios with complex pedestrian interplay where movements are heavily interdependent.

To incorporate these social influences efficiently, the decoder utilizes the neighbor set identified via the spatial distance threshold (as defined in Section 3.1). This design is critical for scalability. Let P denote the total number of pedestrians in the scene and N_max be the maximum number of neighbors considered (set to 50 in our experiments). While global attention mechanisms inherently suffer from quadratic complexity , our distance-based filtering reduces the interaction scope to a local subset. Consequently, the computational cost for the social module scales linearly . This ensures that the model remains lightweight and responsive even in high-density crowds. The historical trajectory X_j of each neighbor is processed to obtain its summarized context vector c_j. In our proposed LGCMT model, this is achieved by first flattening its historical trajectory X_j into a single vector, which is then processed through a dedicated linear embedding layer. This approach provides computationally efficient contextual representations c_j for each neighboring agent, suitable for the subsequent social attention module.

For each of the K motion patterns selected from the library (as described in the Structured multi-mode hypothesis generation via motion pattern library subsection), the decoding process generates a corresponding future trajectory Y_(i,m). This generation process is non-autoregressive, predicting all T_pred future coordinates simultaneously for enhanced inference speed. The process begins with a feature representation for the target pedestrian i that is specific to the selected mode m. This feature, let’s denote it as , is derived from the target’s fused historical features and incorporates the corresponding pattern embedding .

Crucially, social context is then integrated in a mode-specific manner using a social attention mechanism. This mechanism takes the target pedestrian’s mode-specific feature as the query, while the set of neighbor context vectors serve as keys and values. This allows the model to dynamically weigh the influence of each surrounding individual conditioned specifically on the motion hypothesis m being considered:

(14)

This yields a socially-informed, mode-specific feature vector, , which encapsulates the target’s history, the specific motion pattern’s influence, and relevant social interactions pertinent to that mode.

This resultant feature is then fed directly into a shared regression network, MLP_reg. This network outputs the complete T_pred-step future trajectory Y_(i,m) for the corresponding mode m:

(15)

This entire procedure, from preparing the mode-specific query to final regression, is executed in parallel for each of the K selected modes, efficiently producing the diverse set of trajectory candidates.

This parallel and non-autoregressive design efficiently produces a diverse set of trajectory candidates. By bypassing the iterative recurrence of traditional LSTM-based decoders (which require T_pred sequential steps), our regression head generates the full prediction horizon in a single forward pass ( temporal complexity). Combined with the linear spatial complexity of the social attention, this architecture achieves a significant reduction in inference latency. The explicit modeling of social interactions makes this decoding strategy particularly effective in complex real-world environments, while the optimized computational design ensures adaptability across varying crowd densities.

Algorithm 1: LGCMT Prediction Process

3.5 Training strategy and loss functions

The model is trained by minimizing a composite loss function, L_total, designed to ensure both prediction accuracy and appropriate mode selection. This total loss is a weighted sum of a trajectory prediction loss L_pred and a mode classification loss L_mode, balanced by hyperparameters and :

(16)

The trajectory prediction loss, L_pred, is formulated to guide the model towards accurate trajectory generation conditioned on the most suitable motion pattern during the training phase. Specifically, for each ground truth future trajectory , we first identify the index m^* of the motion pattern within the library that exhibits the highest similarity to . The model is then explicitly trained to produce a single trajectory prediction, , by utilizing only the computational path associated with this best-matching mode m^*. The prediction loss L_pred is subsequently computed as the Smooth L1 loss between this specific prediction and the ground truth :

(17)

where represents the Smooth L1 loss function, typically averaged over all predicted time steps and samples. This training approach ensures that the learning signal focuses on generating accurate predictions from the identified ‘target’ mode, differing fundamentally from the inference-time evaluation procedure where the best among k_eval generated hypotheses is selected based on distance metrics.

The mode classification loss, L_mode, guides the model to identify appropriate underlying motion patterns from the library. A target “soft” probability distribution q_i over the N_lib library patterns is first derived by comparing the normalized ground truth future to each normalized library pattern M_k. Similarity scores s_(i,k) are converted to probabilities q_(i,k) using a Softmax function with temperature :

(18)

L_mode is then the cross-entropy between the model’s predicted mode distribution p_i (shown in Eq 13) and this target q_i:

(19)

Jointly optimizing L_pred and L_mode promotes high-quality, diverse, and contextually appropriate predictions. Algorithm 1 summarizes the complete forward pass of our proposed model.

4 Results

4.1 Experimental setup

4.1.1 Datasets.

We evaluate the LGCMT model on the widely used ETH [25] and UCY [26] benchmarks. These benchmarks comprise five distinct scenes, namely ETH, HOTEL, UNIV, ZARA1, and ZARA2, featuring varied pedestrian densities and interaction patterns. The data consists of 2D coordinates recorded at 2.5 Hz. We adhere to standard evaluation protocols [8,10], observing trajectories for 8 time steps (corresponding to 3.2 seconds, denoted T_obs) and predicting the subsequent 12 steps (covering 4.8 seconds, denoted T_pred). A Leave-One-Out Cross-Validation (LOOCV) strategy is employed for evaluation, where the model is trained on four scenes and tested on the remaining fifth scene, iterating this process so that each scene serves as the test set once. Input trajectory observations undergo normalization: the starting point is translated to the origin, and the trajectory is rotated to align its initial motion direction approximately with the X-axis.

In addition, we evaluate our model on the Stanford Drone Dataset (SDD) [27], a large-scale benchmark with higher crowd density and scene complexity. For fair comparison, we adopt the same T_obs = 8 and T_pred = 12 settings.

4.1.2 Evaluation metrics.

Model performance is quantified using the Average Displacement Error (ADE) and Final Displacement Error (FDE). ADE measures the mean L2 distance between the predicted trajectory and the ground truth trajectory over all predicted time steps:

(20)

FDE calculates the L2 distance specifically at the final predicted time step T_pred:

(21)

To account for the inherent multi-modality of future trajectories, we follow common practice [5] by generating K potential trajectories and reporting the minimum ADE and FDE among these candidates, denoted minADE_K and minFDE_K, averaged across all test samples. Unless otherwise specified, we use K = 20. Lower ADE and FDE values indicate better prediction accuracy.

4.1.3 Hyperparameter settings.

The proposed LGCMT model was implemented using the PyTorch framework, and all experiments were conducted on an NVIDIA RTX 4070 GPU. The core motion mode library is constructed offline via K-Means clustering. The size of the motion library N_lib was optimized per scene: 50 for ETH, 90 for HOTEL, 50 for UNIV, 70 for ZARA1, and 50 for ZARA2, while N_lib = 100 was used for the SDD. During inference, we select the top K = 20 modes, consistent with the evaluation protocol. Regarding the neighbor selection strategy defined in Section 3.1, we set the spatial distance threshold to 2.0 meters for the ETH/UCY datasets and 5.0 units for the SDD to capture socially significant interactions within the respective spatial scales.

Regarding model architecture, the core hidden dimension d_model was set to 256 for the ETH and HOTEL datasets, 128 for the UNIV, ZARA1, and ZARA2 datasets, and 64 for the SDD. The local-global collaborative encoder employs 3 stacked Transformer blocks for the ETH dataset and 2 blocks for all other datasets (including SDD). All multi-head attention mechanisms within the encoders and decoders utilize 4 attention heads. The local history window size R_window for the SCT-MSA module was also tuned specifically for each scene: 4 for ETH and HOTEL, 7 for UNIV and ZARA1, 5 for ZARA2, and 3 for the SDD. For training, we used a batch size of 128 and adhered to the optimization strategy detailed in the Training strategy and loss functions subsection.

4.2 Comparison with existing methods

To rigorously evaluate the proposed LGCMT framework, we compare it against a comprehensive set of contemporary methods detailed in Table 1. These include Social-STGCNN [16], STAR [6], PECNet [28], AgentFormer [5], SGCN [15], SIT [29], MemoNet [30], SocialVAE [21], TUTR [24], BCDiff [7], FlowChain [23], SMEMO [4], Social NSTransformers [32], TP-EGT [18], TPPO [31], Social Entropy Informer [33], Social Informer [34], and W-DGTrans [35].

Download:

Table 1. Performance comparison on the ETH and UCY datasets. Values represent minADE/minFDE in meters. The ↓ symbol indicates that lower values are better. The best results are shown in bold.

https://doi.org/10.1371/journal.pone.0347049.t001

4.2.1 Performance on ETH/UCY datasets.

On the standard ETH and UCY benchmarks, LGCMT demonstrates robust performance, achieving the lowest average errors across all compared methods with a minADE of 0.20 and a minFDE of 0.34. This represents a substantial improvement over earlier baselines and a competitive edge over the most recent approaches.

Specifically, we compare our method against the recent works of TP-EGT [18] and TPPO [31] as highlighted in recent literature. LGCMT outperforms the graph-transformer-based TP-EGT (Average ADE 0.23) by approximately 13.0% and significantly surpasses the pose-optimization-based TPPO (Average ADE 0.39) with a reduction in error of roughly 48.7%. Furthermore, compared to the 2026 baseline W-DGTrans [35], which reports an average ADE of 0.21 meters, our model maintains a performance advantage, particularly in the HOTEL and ZARA2 scenes where distinct motion patterns and social interactions are prevalent.

The breakdown by scene reveals that LGCMT achieves the best reported ADE in the HOTEL (0.11), UNIV (0.23), and ZARA2 (0.14) subsets. This consistency across datasets with varying pedestrian densities validates the effectiveness of the local-global collaborative encoder. By simultaneously capturing the fine-grained local dynamics and the long-term individual intent, the model effectively mitigates the trade-off often observed in other methods that may overfit to specific scene types.

4.2.2 Performance on SDD dataset.

To address the limitations associated with the relatively small scale of the ETH/UCY datasets and to test the model’s scalability in high-density, real-world environments, we extended our evaluation to the SDD [27]. As shown in Table 2, SDD presents significantly greater challenges due to its bird’s-eye view, diverse agent types (including cyclists and skaters), and complex static obstacles.

Download:

Table 2. Performance comparison on the stanford drone dataset (SDD). Prediction errors are reported as ADE/FDE in pixels. Values are averaged over the best of 20 predicted trajectories. Lower values are better.

https://doi.org/10.1371/journal.pone.0347049.t002

In this rigorous benchmark, LGCMT achieves a minADE of 7.90 pixels and a minFDE of 13.04 pixels. These results surpass those of competitive baselines, including TUTR [24] (7.99 pixels ADE) and SMEMO [4] (8.11 pixels ADE). The superior performance on SDD is particularly noteworthy as it confirms that the proposed spatial neighbor filtering strategy allows the model to scale efficiently to crowded scenes. Unlike global attention mechanisms that may suffer from noise accumulation when processing dozens of agents, our method maintains precision by focusing on socially relevant neighbors, thereby demonstrating strong robustness and generalization capabilities in complex, unstructured environments.

4.3 Ablation study

To rigorously validate the architectural choices underpinning the LGCMT model and quantify their individual contributions, we conducted comprehensive ablation studies on the ETH/UCY datasets. These experiments assessed the impact of removing or altering key components on both predictive accuracy, measured by average minimum ADE and FDE over 20 samples in meters, and computational efficiency via average inference time in milliseconds. All ablation experiments maintained the primary experimental setup on an NVIDIA RTX 4070 GPU.

4.3.1 Component effectiveness analysis.

We first investigated the contribution of LGCMT’s primary architectural modules by systematically removing or modifying them. The results, detailed in Table 3, reveal the significance of each design decision.

Download:

Table 3. Ablation study results on the ETH and UCY datasets. All values are minADE/minFDE in meters. The performance of the full model is highlighted in bold.

https://doi.org/10.1371/journal.pone.0347049.t003

The necessity of the local-global collaborative encoding strategy is immediately apparent. Removing the global context encoder, hereafter referred to as GCE, led to a substantial performance decline; average ADE increased by 20.0% to 0.24 and average FDE rose by 20.6% to 0.41. This underscores the critical role of the GCE in capturing broader trajectory trends and inferring longer-term pedestrian intent. Similarly, ablating the causal temporal encoder, hereafter CTE, also hampered predictive accuracy, elevating average ADE by 10.0% to 0.22 and FDE by 11.8% to 0.38. While this impact was less severe than removing the global context, it confirms the importance of the CTE’s fine-grained local motion modeling. These findings collectively indicate that relying solely on a single temporal scale is insufficient; LGCMT’s strength derives from synergistically integrating local dynamics captured by the CTE with global motion understanding provided by the GCE.

The experiments further highlight the profound impact of the prediction guidance mechanism. The most significant performance degradation occurred when the motion mode library was removed. Without this guidance, average ADE surged by 60.0% to 0.32, and FDE increased dramatically by 70.6% to 0.58. This starkly illustrates that generating structured multi-mode hypotheses, informed by learned motion patterns, is fundamental to the model’s accuracy and its capacity for diverse, plausible predictions. Additionally, neglecting social interactions proved detrimental. Removing social context modeling resulted in a 40.0% increase in average ADE to 0.28 and a 47.1% increase in average FDE to 0.50, confirming that accounting for the influence of nearby pedestrians remains crucial for realistic trajectory forecasting, particularly in interactive settings.

Finally, the specific choice of attention mechanisms within the encoders was validated. Replacing the specialized Sparse Causal Temporal Multi-head Self-Attention, known as SCT-MSA, in the CTE with standard self-attention resulted in a noticeable drop in performance, yielding an average ADE/FDE of 0.21/0.37. A similar decline to 0.22/0.37 was observed when the GCE’s cosine similarity attention was substituted with standard dot-product attention. These outcomes affirm that the tailored designs of SCT-MSA for local temporal dependencies and cosine similarity attention for global directional trends are more effective within their respective LGCMT modules than generic attention approaches.

4.3.2 Complexity and inference speed analysis.

Beyond predictive accuracy, the practical utility of a forecasting model relies heavily on its computational efficiency. To comprehensively evaluate the real-time performance of the algorithm, we analyzed three key indicators: model parameters (Params), computational complexity (FLOPs), and actual Inference Time.

To ensure a fair and rigorous comparison, we established a unified evaluation protocol. All models listed in Table 4 were re-evaluated on a workstation equipped with an NVIDIA GeForce RTX 4070 GPU. We strictly followed the standard real-time latency evaluation criteria: the inference time was measured with a batch size of 1 and a sampling number of K = 20. Furthermore, all reported data are the average of measurements taken after a warm-up period to exclude system initialization noise and random fluctuations. The values are averaged across the five ETH/UCY datasets to mitigate biases from specific scene characteristics.

Download:

Table 4. Comparison of Model Complexity and Inference Speed. Params are reported in millions (M), FLOPs in gigaflops (G), and inference time in milliseconds (ms). All models were evaluated on an NVIDIA RTX 4070 GPU.

https://doi.org/10.1371/journal.pone.0347049.t004

It is worth noting that the inference time for the same algorithm varies across different scenarios. This variance is primarily attributed to crowd density. Most interaction-aware models, including LGCMT, employ mechanisms where the computational cost scales with the number of agents in the frame. Consequently, densely populated scenarios (such as UNIV) naturally incur slightly higher latency compared to sparse scenarios (such as ETH).

As shown in Table 4, LGCMT achieves an optimal balance between accuracy and efficiency (2.79 ms). When compared to complex Transformer-based architectures, our model demonstrates a significant speed advantage. For instance, MemoNet (156.96 ms) involves processing continuous features alongside high-overhead retrieval operations from an external memory bank, which inherently limits its inference speed. Similarly, AgentFormer (46.58 ms) requires heavy computation for its dense attention mechanisms. Regarding SocialVAE, while it achieves competitive speed (16.27 ms), this metric is recorded without its “Fitted Posterior Check (FPC).” Although FPC can improve precision, it increases latency to approximately 2.8 seconds per scene, rendering it unsuitable for real-time applications. While lightweight baselines like Social-STGCNN (1.59 ms) and TUTR (1.89 ms) are marginally faster, LGCMT maintains a comparable millisecond-level speed while offering more robust trajectory modeling capabilities. This analysis confirms that LGCMT is well-suited for deployment in dynamic environments where both accuracy and low latency are critical.

Impact of Hidden Dimensions: To further explore the scalability and the trade-off between model capacity and efficiency, we conducted an ablation study on the SDD dataset by varying the hidden dimension size (). As detailed in Table 5, increasing the hidden dimension from 16 to 256 leads to a substantial increase in computational cost: parameters grow from 0.08 M to 2.47 M, and FLOPs double from 0.13 G to 0.26 G. However, this increase in complexity does not strictly correlate with performance gains. The model achieves its best predictive accuracy (ADE = 7.90) at H = 64, with an inference time of just 3.02 ms. Interestingly, larger dimensions result in slight performance degradation (ADE = 8.12), likely due to overfitting on the trajectory data. Conversely, an extremely small dimension (H = 16) limits the model’s representational power, leading to higher errors. Consequently, we selected H = 64 as the optimal configuration for our main experiments, as it minimizes computational overhead while maximizing prediction accuracy.

Download:

Table 5. Impact of hidden dimensions on model performance on the SDD dataset.

https://doi.org/10.1371/journal.pone.0347049.t005

4.3.3 Robustness analysis.

To verify the stability and reproducibility of LGCMT, we conducted repeated experiments using 5 different random seeds on both the ETH/UCY and SDD datasets. Table 6 reports the statistics in the format of Mean ± Standard Deviation (Std). While the main results in Table 1 report the best-performing model to ensure a fair comparison with baselines, the results here demonstrate that the deviation between the mean performance and the best run is minimal. For instance, on the challenging ETH scene, the mean ADE is 0.37m compared to the best run of 0.36m. The extremely low standard deviations (e.g., ±0.001 on HOTEL and UNIV) confirm that LGCMT is robust to initialization and achieves consistent performance.

Download:

Table 6. Robustness analysis. We report the Mean ± Standard Deviation over 5 independent runs. The Mean is reported to 2 decimal places, and the Standard Deviation to 3 decimal places to highlight the minimal variance.

https://doi.org/10.1371/journal.pone.0347049.t006

4.3.4 Impact of non-autoregressive decoding.

To isolate the contribution of our chosen decoding strategy, we explicitly compared the standard non-autoregressive (NAR) LGCMT against an autoregressive (AR) counterpart. This baseline, termed LGCMT (AR), utilized the identical encoder architecture but employed a traditional sequential decoder. Table 7 presents the comparison regarding both predictive accuracy, ADE/FDE, and inference time.

Download:

Table 7. Comparison of autoregressive (AR) and non-autoregressive (NAR) versions of LGCMT. Predictive performance is measured in ADE/FDE (meters), and inference time is in milliseconds (ms). Results for the superior NAR model are in bold.

https://doi.org/10.1371/journal.pone.0347049.t007

The advantages of the NAR approach are clear. Regarding predictive accuracy, the NAR model significantly outperformed its AR variant. Average ADE decreased by 25.9% from 0.27 to 0.20, and average FDE saw a 17.1% reduction from 0.41 to 0.34. This accuracy enhancement likely arises from the NAR mechanism generating the full sequence concurrently, mitigating the error accumulation often problematic in step-by-step autoregressive predictions.

The efficiency benefits are even more striking. The NAR-based LGCMT required only 2.79 milliseconds for inference on average. This is approximately 31 times faster than the 87.55 milliseconds needed by the LGCMT (AR) model. This dramatic speed-up is a direct result of the NAR decoder’s parallel computation across all future time steps, contrasting sharply with the inherent sequential processing of AR decoding. This finding strongly advocates for the NAR strategy in applications demanding both high accuracy and rapid response times.

4.3.5 Key hyperparameter sensitivity analysis.

We further investigated how LGCMT’s performance responds to variations in two key hyperparameters: the motion mode library size N_lib, and the local history window size R_window used within the SCT-MSA module. Our primary results employed scene-specific optimal values for N_lib, ranging from 50 to 90, and for R_window, ranging from 4 to 7, achieving the benchmark 0.20/0.34 average ADE/FDE. The following analysis explores performance when these hyperparameters are set uniformly across all scenes.

First, varying the motion mode library size N_lib uniformly across values {30, 50, 70, 90, 110} revealed performance trends depicted in Fig 4. A small library size where N_lib equals 30 yielded a noticeable drop to 0.23/0.40 average ADE/FDE, suggesting insufficient capacity to capture motion diversity. Performance stabilized around 0.21 average ADE and 0.35–0.36 average FDE for N_lib between 50 and 90. Increasing N_lib further to 110 resulted in a slight degradation to 0.22/0.37. This indicates that while a sufficiently large library is crucial, an excessively large N_lib offers diminishing returns and can slightly dilute performance, reinforcing the benefit of scene-specific tuning.

Download:

Fig 4. Sensitivity to the motion-pattern library size N_lib.

Bars report the average minADE/minFDE when a single N_lib is used for all scenes. Horizontal dashed lines denote the baseline obtained with scene-specific optimal N_lib. Performance is stable for , while per-scene tuning achieves the best accuracy.

https://doi.org/10.1371/journal.pone.0347049.g004

Next, evaluating the influence of the local history window size R_window with uniform values from {2, 3, 4, 5, 6, 7}, as shown in Fig 5, indicated greater robustness compared to variations in N_lib. Across the tested range, average ADE remained between 0.21 and 0.22, and average FDE between 0.35 and 0.37. Settings such as R_window = 4 or R_window = 5 produced a solid 0.21/0.35 average ADE/FDE. Even extreme values did not cause sharp performance drops. This suggests LGCMT is relatively insensitive to the exact local window size, although optimal scene-specific selection, as used in our main experiments, can provide marginal gains, further validating our adaptive configuration strategy.

Download:

Fig 5. Sensitivity to the local history window size R_window in SCT-MSA.

Bars report the average minADE/minFDE when a single is used for all scenes. Horizontal dashed lines denote the baseline obtained with scene-specific optimal R_window. The model is robust to R_window, with stable performance around R_window = 4–5.

https://doi.org/10.1371/journal.pone.0347049.g005

4.3.6 Visualization.

We complement our quantitative evaluation with qualitative visualizations in real-world scenarios selected from the ETH and UCY test sets.

Fig 6 showcases the diversity of predictions generated by LGCMT (K = 20). The visualization confirms that our library-guided approach can hypothesize varied yet plausible outcomes, effectively covering the ground truth. This ability to model distribution spread is crucial for capturing motion uncertainty in dynamic environments.

Download:

Fig 6. Multi-modal trajectory predictions on ETH/UCY scenes.

(A) ETH, (B) HOTEL, (C) UNIV, (D) ZARA1, and (E) ZARA2. Observed trajectories are shown in green, ground-truth futures in blue, predicted trajectories are shown as red dashed lines, and the best prediction is highlighted in solid red.

https://doi.org/10.1371/journal.pone.0347049.g006

Furthermore, Fig 7 compares the best-predicted trajectory of LGCMT against the TUTR baseline [24]. In scenarios requiring complex maneuvers, such as navigating through crowds or approaching destinations, LGCMT demonstrates stronger adherence to the ground truth. Unlike the baseline, which may struggle with sudden directional changes, our model effectively captures fine-grained dynamics. Collectively, these visualizations validate that the proposed Local-Global Collaborative Encoder and NAR decoder successfully capture intricate pedestrian dynamics.

Download:

Fig 7. Qualitative comparison with a baseline method.

(A) ETH, (B) HOTEL, (C) UNIV, and (D) ZARA. Observations are shown in green and ground truth in blue. LGCMT is shown in red and the baseline is shown in orange.

https://doi.org/10.1371/journal.pone.0347049.g007

5 Discussion

The experimental results presented offer critical insights into the mechanisms underpinning LGCMT’s performance. The model’s effectiveness is rooted in the synergistic collaboration of its architectural components, which balance historical interpretation, structured guidance, and operational efficiency.

The ablation studies confirm that accurate prediction requires disentangling multi-scale temporal dynamics. The significant performance degradation observed when removing either the Global Context Encoder or the Causal Temporal Encoder underscores that neither short-term kinematics nor long-term intent is sufficient in isolation. By integrating these through specialized attention mechanisms (SCT-MSA and Cosine Similarity), LGCMT effectively captures the duality of pedestrian motion—reacting to immediate surroundings while maintaining a consistent destination.

Furthermore, the motion mode library proves to be a cornerstone for ensuring predictive diversity. By constraining trajectory generation to a learned set of behavioral patterns, the model effectively mitigates mode collapse and unrealistic path generation. This structured guidance, coupled with explicit social interaction modeling, ensures that predictions are not only diverse but also socially compliant. The strategic choice of the Library-Guided NAR decoder is further validated by the efficiency analysis. By eliminating the sequential bottleneck of autoregressive models, LGCMT achieves a dramatic inference speedup (approximately 30×) while maintaining high accuracy, confirming its suitability for latency-critical real-world applications.

Despite its demonstrated effectiveness, LGCMT offers several clear directions for further improvement. First, the reliance on a pre-constructed motion library means generalization is tied to the diversity of the training data. Future work could explore online adaptation or dynamic mode discovery to strengthen robustness under out-of-distribution behaviors.

Second, in terms of environmental context, the current framework relies solely on trajectory data and therefore does not explicitly model map semantics or obstacle geometry. Although the model can indirectly infer walkable regions from agents’ historical behaviors, it lacks explicit physical grounding to ensure collision-free predictions when facing complex static structures in highly organized scenes. This choice was made to prioritize computational efficiency and emphasize dynamic social interactions. Nonetheless, incorporating a lightweight, scene-centric branch to process semantic maps or occupancy grids would be a natural next step. Such an extension could improve generalization in navigation-intensive environments while largely preserving the inference efficiency of our architecture.

Finally, expanding the evaluation to broader domains is a promising direction. Beyond pedestrian dynamics, complex multi-agent interactions are prevalent in sports analytics, such as the NBA dataset. While distinct in input modality and team strategies, such scenarios share the need for modeling coupled spatiotemporal behaviors. Adapting LGCMT to handle these domain-specific constraints would be a valuable step toward testing the cross-domain generalizability of our local-global and library-guided approach.

6 Conclusion

This paper presented LGCMT, an efficient pedestrian trajectory prediction framework that adeptly captures the complex, multi-modal nature of human movement. The core strength of LGCMT lies in its innovative local-global collaborative encoder, which synergistically employs sparse causal temporal attention for local dynamics and cosine similarity attention for global patterns to construct a comprehensive historical representation. To address predictive diversity, we introduced a structured hypothesis mechanism guided by a motion mode library, ensuring the generation of varied yet plausible futures. By integrating social interaction modeling with an efficient non-autoregressive parallel decoder, LGCMT not only achieves competitive accuracy on the standard ETH/UCY benchmarks but also demonstrates robust scalability on the challenging SDD. These results confirm that LGCMT offers a compelling balance of performance and efficiency, making it highly suitable for practical deployment in real-world applications.

References

1. Chen X, Zhang H, Deng F, Liang J, Yang J. Stochastic Non-Autoregressive Transformer-Based Multi-Modal Pedestrian Trajectory Prediction for Intelligent Vehicles. IEEE Trans Intell Transport Syst. 2024;25(5):3561–74.
- View Article
- Google Scholar
2. Lian J, Yu F, Li L, Zhou Y. Causal Temporal–Spatial Pedestrian Trajectory Prediction With Goal Point Estimation and Contextual Interaction. IEEE Trans Intell Transport Syst. 2022;23(12):24499–509.
- View Article
- Google Scholar
3. Yang C, Pan H, Sun W, Gao H. Social Self-Attention Generative Adversarial Networks for Human Trajectory Prediction. IEEE Trans Artif Intell. 2024;5(4):1805–15.
- View Article
- Google Scholar
4. Marchetti F, Becattini F, Seidenari L, Bimbo AD. SMEMO: Social Memory for Trajectory Forecasting. IEEE Trans Pattern Anal Mach Intell. 2024;46(6):4410–25. pmid:38252585
- View Article
- PubMed/NCBI
- Google Scholar
5. Yuan Y, Weng X, Ou Y, Kitani K. AgentFormer: Agent-Aware Transformers for Socio-Temporal Multi-Agent Forecasting. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021. 9793–803. https://doi.org/10.1109/iccv48922.2021.00967
6. Yu C, Ma X, Ren J, Zhao H, Yi S. Spatio-Temporal Graph Transformer Networks for Pedestrian Trajectory Prediction. Lecture Notes in Computer Science. Springer International Publishing. 2020. 507–23. https://doi.org/10.1007/978-3-030-58610-2_30
7. Chen G, Li C, Li R, Ren D, Wang G, Yuan Y. BCDiff: Bidirectional Consistent Diffusion for Instantaneous Trajectory Prediction. In: Advances in Neural Information Processing Systems 36, 2023. 14400–13. https://doi.org/10.52202/075280-0633
8. Gupta A, Johnson J, Fei-Fei L, Savarese S, Alahi A. Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018. 2255–64. https://doi.org/10.1109/cvpr.2018.00240
9. Giuliari F, Hasan I, Cristani M, Galasso F. Transformer Networks for Trajectory Forecasting. In: 2020 25th International Conference on Pattern Recognition (ICPR), 2021. 10335–42. https://doi.org/10.1109/icpr48806.2021.9412190
10. Alahi A, Goel K, Ramanathan V, Robicquet A, Fei-Fei L, Savarese S. Social LSTM: Human Trajectory Prediction in Crowded Spaces. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 961–71. https://doi.org/10.1109/cvpr.2016.110
11. Sadeghian A, Kosaraju V, Sadeghian A, Hirose N, Rezatofighi H, Savarese S. SoPhie: An Attentive GAN for Predicting Paths Compliant to Social and Physical Constraints. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 1349–58. https://doi.org/10.1109/cvpr.2019.00144
12. Shafiee N, Padir T, Elhamifar E. Introvert: Human Trajectory Prediction via Conditional 3D Attention. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 16810–20. https://doi.org/10.1109/cvpr46437.2021.01654
13. Dong Y, Wang L, Zhou S, Tang W, Hua G, Sun C. AFC-RNN: Adaptive Forgetting-Controlled Recurrent Neural Network for Pedestrian Trajectory Prediction. IEEE Trans Pattern Anal Mach Intell. 2025;47(11):10177–91. pmid:40742860
- View Article
- PubMed/NCBI
- Google Scholar
14. Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. In: Proceedings of the 5th International Conference on Learning Representations (ICLR), Toulon, France, 2017. 1–14. https://openreview.net/forum?id=SJU4ayYgl
15. Shi L, Wang L, Long C, Zhou S, Zhou M, Niu Z, et al. SGCN:Sparse Graph Convolution Network for Pedestrian Trajectory Prediction. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 8990–9. https://doi.org/10.1109/cvpr46437.2021.00888
16. Mohamed A, Qian K, Elhoseiny M, Claudel C. Social-STGCNN: A Social Spatio-Temporal Graph Convolutional Neural Network for Human Trajectory Prediction. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 14412–20. https://doi.org/10.1109/cvpr42600.2020.01443
17. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 2017. 6000–10.
18. Yang B, Fan F, Ni R, Wang H, Jafaripournimchahi A, Hu H. A Multi-Task Learning Network With a Collision-Aware Graph Transformer for Traffic-Agents Trajectory Prediction. IEEE Trans Intell Transport Syst. 2024;25(7):6677–90.
- View Article
- Google Scholar
19. Ni R, Lu S, Hu C, Yang B. Adaptive Progressive Transformer-Based Trajectory Prediction Under Fine-Grained Trajectory-Scene Interaction Constraint. IEEE Trans Automat Sci Eng. 2025;22:24498–509.
- View Article
- Google Scholar
20. Lee N, Choi W, Vernaza P, Choy CB, Torr PHS, Chandraker M. DESIRE: Distant Future Prediction in Dynamic Scenes with Interacting Agents. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 2165–74. https://doi.org/10.1109/cvpr.2017.233
21. Xu P, Hayet J-B, Karamouzas I. SocialVAE: Human Trajectory Prediction Using Timewise Latents. Lecture Notes in Computer Science. Springer Nature Switzerland. 2022. 511–28. https://doi.org/10.1007/978-3-031-19772-7_30
22. Gu T, Chen G, Li J, Lin C, Rao Y, Zhou J, et al. Stochastic Trajectory Prediction via Motion Indeterminacy Diffusion. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 17092–101. https://doi.org/10.1109/cvpr52688.2022.01660
23. Maeda T, Ukita N. Fast Inference and Update of Probabilistic Density Estimation on Trajectory Prediction. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 9761–71. https://doi.org/10.1109/iccv51070.2023.00898
24. Shi L, Wang L, Zhou S, Hua G. Trajectory Unified Transformer for Pedestrian Trajectory Prediction. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 9641–50. https://doi.org/10.1109/iccv51070.2023.00887
25. Pellegrini S, Ess A, Schindler K, van Gool L. You’ll never walk alone: Modeling social behavior for multi-target tracking. In: 2009 IEEE 12th International Conference on Computer Vision, 2009. 261–8. https://doi.org/10.1109/iccv.2009.5459260
26. Lerner A, Chrysanthou Y, Lischinski D. Crowds by Example. Computer Graphics Forum. 2007;26(3):655–64.
- View Article
- Google Scholar
27. Robicquet A, Sadeghian A, Alahi A, Savarese S. Learning social etiquette: Human trajectory understanding in crowded scenes. In: Computer Vision – ECCV 2016. vol. 9912. Amsterdam, The Netherlands: Springer. 2016. 549–65.
28. Mangalam K, Girase H, Agarwal S, Lee KH, Adeli E, Malik J. It is not the journey but the destination: Endpoint conditioned trajectory prediction. In: Computer Vision – ECCV 2020: 16th European Conference, Part II, Glasgow, UK, 2020. 759–76.
29. Shi L, Wang L, Long C, Zhou S, Zheng F, Zheng N, et al. Social Interpretable Tree for Pedestrian Trajectory Prediction. AAAI. 2022;36(2):2235–43.
- View Article
- Google Scholar
30. Xu C, Mao W, Zhang W, Chen S. Remember Intentions: Retrospective-Memory-based Trajectory Prediction. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 6478–87. https://doi.org/10.1109/cvpr52688.2022.00638
31. Yang B, He C, Wang P, Chan C-Y, Liu X, Chen Y. TPPO: A Novel Trajectory Predictor With Pseudo Oracle. IEEE Trans Syst Man Cybern, Syst. 2024;54(5):2846–59.
- View Article
- Google Scholar
32. Jiang Z, Ma Y, Shi B, Lu X, Xing J, Gonçalves N. Social ntransformers: Low-quality pedestrian trajectory prediction. IEEE Transactions on Artificial Intelligence. 2024;5:5575–88.
- View Article
- Google Scholar
33. Jiang Z, Qin C, Yang R, Shi B, Alsaadi FE, Wang Z. Social Entropy Informer: A Multi-Scale Model-Data Dual-Driven Approach for Pedestrian Trajectory Prediction. IEEE Trans Intell Transport Syst. 2025;26(10):16438–53.
- View Article
- Google Scholar
34. Jiang Z, Yang R, Ma Y, Qin C, Chen X, Wang Z. Social Informer: Pedestrian Trajectory Prediction by Informer With Adaptive Trajectory Probability Region Optimization. IEEE Trans Cybern. 2026;56(1):15–28. pmid:41052188
- View Article
- PubMed/NCBI
- Google Scholar
35. Wen Y, Li Z, Xu P. Dynamic graph transformer for pedestrian potential trajectory prediction under the world perspective. Neurocomputing. 2026;664:132125.
- View Article
- Google Scholar

[ref1] 1. Chen X, Zhang H, Deng F, Liang J, Yang J. Stochastic Non-Autoregressive Transformer-Based Multi-Modal Pedestrian Trajectory Prediction for Intelligent Vehicles. IEEE Trans Intell Transport Syst. 2024;25(5):3561–74.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Lian J, Yu F, Li L, Zhou Y. Causal Temporal–Spatial Pedestrian Trajectory Prediction With Goal Point Estimation and Contextual Interaction. IEEE Trans Intell Transport Syst. 2022;23(12):24499–509.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Yang C, Pan H, Sun W, Gao H. Social Self-Attention Generative Adversarial Networks for Human Trajectory Prediction. IEEE Trans Artif Intell. 2024;5(4):1805–15.
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref4] 4. Marchetti F, Becattini F, Seidenari L, Bimbo AD. SMEMO: Social Memory for Trajectory Forecasting. IEEE Trans Pattern Anal Mach Intell. 2024;46(6):4410–25. pmid:38252585
View Article
PubMed/NCBI
Google Scholar

[11] View Article

[12] PubMed/NCBI

[13] Google Scholar

[ref5] 5. Yuan Y, Weng X, Ou Y, Kitani K. AgentFormer: Agent-Aware Transformers for Socio-Temporal Multi-Agent Forecasting. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021. 9793–803. https://doi.org/10.1109/iccv48922.2021.00967

[ref6] 6. Yu C, Ma X, Ren J, Zhao H, Yi S. Spatio-Temporal Graph Transformer Networks for Pedestrian Trajectory Prediction. Lecture Notes in Computer Science. Springer International Publishing. 2020. 507–23. https://doi.org/10.1007/978-3-030-58610-2_30

[ref7] 7. Chen G, Li C, Li R, Ren D, Wang G, Yuan Y. BCDiff: Bidirectional Consistent Diffusion for Instantaneous Trajectory Prediction. In: Advances in Neural Information Processing Systems 36, 2023. 14400–13. https://doi.org/10.52202/075280-0633

[ref8] 8. Gupta A, Johnson J, Fei-Fei L, Savarese S, Alahi A. Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018. 2255–64. https://doi.org/10.1109/cvpr.2018.00240

[ref9] 9. Giuliari F, Hasan I, Cristani M, Galasso F. Transformer Networks for Trajectory Forecasting. In: 2020 25th International Conference on Pattern Recognition (ICPR), 2021. 10335–42. https://doi.org/10.1109/icpr48806.2021.9412190

[ref10] 10. Alahi A, Goel K, Ramanathan V, Robicquet A, Fei-Fei L, Savarese S. Social LSTM: Human Trajectory Prediction in Crowded Spaces. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 961–71. https://doi.org/10.1109/cvpr.2016.110

[ref11] 11. Sadeghian A, Kosaraju V, Sadeghian A, Hirose N, Rezatofighi H, Savarese S. SoPhie: An Attentive GAN for Predicting Paths Compliant to Social and Physical Constraints. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 1349–58. https://doi.org/10.1109/cvpr.2019.00144

[ref12] 12. Shafiee N, Padir T, Elhamifar E. Introvert: Human Trajectory Prediction via Conditional 3D Attention. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 16810–20. https://doi.org/10.1109/cvpr46437.2021.01654

[ref13] 13. Dong Y, Wang L, Zhou S, Tang W, Hua G, Sun C. AFC-RNN: Adaptive Forgetting-Controlled Recurrent Neural Network for Pedestrian Trajectory Prediction. IEEE Trans Pattern Anal Mach Intell. 2025;47(11):10177–91. pmid:40742860
View Article
PubMed/NCBI
Google Scholar

[23] View Article

[24] PubMed/NCBI

[25] Google Scholar

[ref14] 14. Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. In: Proceedings of the 5th International Conference on Learning Representations (ICLR), Toulon, France, 2017. 1–14. https://openreview.net/forum?id=SJU4ayYgl

[ref15] 15. Shi L, Wang L, Long C, Zhou S, Zhou M, Niu Z, et al. SGCN:Sparse Graph Convolution Network for Pedestrian Trajectory Prediction. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 8990–9. https://doi.org/10.1109/cvpr46437.2021.00888

[ref16] 16. Mohamed A, Qian K, Elhoseiny M, Claudel C. Social-STGCNN: A Social Spatio-Temporal Graph Convolutional Neural Network for Human Trajectory Prediction. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 14412–20. https://doi.org/10.1109/cvpr42600.2020.01443

[ref17] 17. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 2017. 6000–10.

[ref18] 18. Yang B, Fan F, Ni R, Wang H, Jafaripournimchahi A, Hu H. A Multi-Task Learning Network With a Collision-Aware Graph Transformer for Traffic-Agents Trajectory Prediction. IEEE Trans Intell Transport Syst. 2024;25(7):6677–90.
View Article
Google Scholar

[31] View Article

[32] Google Scholar

[ref19] 19. Ni R, Lu S, Hu C, Yang B. Adaptive Progressive Transformer-Based Trajectory Prediction Under Fine-Grained Trajectory-Scene Interaction Constraint. IEEE Trans Automat Sci Eng. 2025;22:24498–509.
View Article
Google Scholar

[34] View Article

[35] Google Scholar

[ref20] 20. Lee N, Choi W, Vernaza P, Choy CB, Torr PHS, Chandraker M. DESIRE: Distant Future Prediction in Dynamic Scenes with Interacting Agents. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 2165–74. https://doi.org/10.1109/cvpr.2017.233

[ref21] 21. Xu P, Hayet J-B, Karamouzas I. SocialVAE: Human Trajectory Prediction Using Timewise Latents. Lecture Notes in Computer Science. Springer Nature Switzerland. 2022. 511–28. https://doi.org/10.1007/978-3-031-19772-7_30

[ref22] 22. Gu T, Chen G, Li J, Lin C, Rao Y, Zhou J, et al. Stochastic Trajectory Prediction via Motion Indeterminacy Diffusion. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 17092–101. https://doi.org/10.1109/cvpr52688.2022.01660

[ref23] 23. Maeda T, Ukita N. Fast Inference and Update of Probabilistic Density Estimation on Trajectory Prediction. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 9761–71. https://doi.org/10.1109/iccv51070.2023.00898

[ref24] 24. Shi L, Wang L, Zhou S, Hua G. Trajectory Unified Transformer for Pedestrian Trajectory Prediction. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 9641–50. https://doi.org/10.1109/iccv51070.2023.00887

[ref25] 25. Pellegrini S, Ess A, Schindler K, van Gool L. You’ll never walk alone: Modeling social behavior for multi-target tracking. In: 2009 IEEE 12th International Conference on Computer Vision, 2009. 261–8. https://doi.org/10.1109/iccv.2009.5459260

[ref26] 26. Lerner A, Chrysanthou Y, Lischinski D. Crowds by Example. Computer Graphics Forum. 2007;26(3):655–64.
View Article
Google Scholar

[43] View Article

[44] Google Scholar

[ref27] 27. Robicquet A, Sadeghian A, Alahi A, Savarese S. Learning social etiquette: Human trajectory understanding in crowded scenes. In: Computer Vision – ECCV 2016. vol. 9912. Amsterdam, The Netherlands: Springer. 2016. 549–65.

[ref28] 28. Mangalam K, Girase H, Agarwal S, Lee KH, Adeli E, Malik J. It is not the journey but the destination: Endpoint conditioned trajectory prediction. In: Computer Vision – ECCV 2020: 16th European Conference, Part II, Glasgow, UK, 2020. 759–76.

[ref29] 29. Shi L, Wang L, Long C, Zhou S, Zheng F, Zheng N, et al. Social Interpretable Tree for Pedestrian Trajectory Prediction. AAAI. 2022;36(2):2235–43.
View Article
Google Scholar

[48] View Article

[49] Google Scholar

[ref30] 30. Xu C, Mao W, Zhang W, Chen S. Remember Intentions: Retrospective-Memory-based Trajectory Prediction. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 6478–87. https://doi.org/10.1109/cvpr52688.2022.00638

[ref31] 31. Yang B, He C, Wang P, Chan C-Y, Liu X, Chen Y. TPPO: A Novel Trajectory Predictor With Pseudo Oracle. IEEE Trans Syst Man Cybern, Syst. 2024;54(5):2846–59.
View Article
Google Scholar

[52] View Article

[53] Google Scholar

[ref32] 32. Jiang Z, Ma Y, Shi B, Lu X, Xing J, Gonçalves N. Social ntransformers: Low-quality pedestrian trajectory prediction. IEEE Transactions on Artificial Intelligence. 2024;5:5575–88.
View Article
Google Scholar

[55] View Article

[56] Google Scholar

[ref33] 33. Jiang Z, Qin C, Yang R, Shi B, Alsaadi FE, Wang Z. Social Entropy Informer: A Multi-Scale Model-Data Dual-Driven Approach for Pedestrian Trajectory Prediction. IEEE Trans Intell Transport Syst. 2025;26(10):16438–53.
View Article
Google Scholar

[58] View Article

[59] Google Scholar

[ref34] 34. Jiang Z, Yang R, Ma Y, Qin C, Chen X, Wang Z. Social Informer: Pedestrian Trajectory Prediction by Informer With Adaptive Trajectory Probability Region Optimization. IEEE Trans Cybern. 2026;56(1):15–28. pmid:41052188
View Article
PubMed/NCBI
Google Scholar

[61] View Article

[62] PubMed/NCBI

[63] Google Scholar

[ref35] 35. Wen Y, Li Z, Xu P. Dynamic graph transformer for pedestrian potential trajectory prediction under the world perspective. Neurocomputing. 2026;664:132125.
View Article
Google Scholar

[65] View Article

[66] Google Scholar

Figures

Abstract

1 Introduction

2 Related work

2.1 Spatiotemporal representation and interaction modeling

2.2 Multimodality and inference efficiency

3 Materials and methods

3.1 Problem formulation and model overview

3.2 Pedestrian history encoding: A local-global collaborative approach

3.2.1 Causal temporal encoder (CTE) for efficient local dynamic extraction.

3.2.2 Global context encoder (GCE) for macro-level patterns.

3.2.3 Feature fusion for comprehensive understanding and computational considerations.

3.3 Structured multi-mode hypothesis generation via motion pattern library

3.4 Socially-aware parallel trajectory decoding

3.5 Training strategy and loss functions

4 Results

4.1 Experimental setup

4.1.1 Datasets.

4.1.2 Evaluation metrics.

4.1.3 Hyperparameter settings.

4.2 Comparison with existing methods

4.2.1 Performance on ETH/UCY datasets.

4.2.2 Performance on SDD dataset.

4.3 Ablation study

4.3.1 Component effectiveness analysis.

4.3.2 Complexity and inference speed analysis.

4.3.3 Robustness analysis.

4.3.4 Impact of non-autoregressive decoding.

4.3.5 Key hyperparameter sensitivity analysis.

4.3.6 Visualization.

5 Discussion

6 Conclusion

References