Figures
Abstract
Pedestrian trajectory prediction is crucial for autonomous vehicles, which face challenges in integrating complex spatiotemporal dynamics, managing multi-modal future behaviors, and ensuring real-time performance. This paper introduces the Local-Global Collaborative Transformer Network (LGCMT) to address these issues. LGCMT features an innovative local-global collaborative encoder comprising two key modules: a Sparse Causal Temporal Attention (SCT-MSA) module, designed to extract fine-grained local causal dynamics, and a Global Context Encoder that utilizes Cosine Similarity Attention to capture macro-level spatiotemporal patterns. For multi-modal prediction, LGCMT employs a parallel Non-Autoregressive (NAR) decoder guided by a motion pattern library, which efficiently generates diverse trajectory candidates covering key future likelihoods. Extensive evaluations on the standard ETH/UCY benchmarks and the large-scale Stanford Drone Dataset (SDD) demonstrate LGCMT’s robust performance. On ETH/UCY, the model improves ADE and FDE by approximately 4.8% and 5.6% compared to the competitive TUTR baseline. Moreover, the proposed framework achieves exceptional inference efficiency, establishing LGCMT as a potent solution that effectively balances accuracy, multi-modality, and operational speed for real-time applications.
Citation: Gong S, Bao Y, Hou Y, Lu W, Shi Q (2026) Local causal dynamic integrated global mode guidance transformer network for pedestrian trajectory prediction. PLoS One 21(4): e0347049. https://doi.org/10.1371/journal.pone.0347049
Editor: Farman Ullah, UAEU: United Arab Emirates University, UNITED ARAB EMIRATES
Received: July 14, 2025; Accepted: March 26, 2026; Published: April 20, 2026
Copyright: © 2026 Gong et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant results are reported within the paper. The author-generated code underlying the findings, together with selected pre-trained model weights and documentation for installation and use, is openly available on GitHub at https://github.com/NTU24pg/LGCMT under the MIT License. The pre-processed datasets used in this study, including ETH, UCY, and SDD, are publicly available on Zenodo (DOI: https://doi.org/10.5281/zenodo.15691159) to ensure long-term accessibility. The original ETH, UCY, and SDD benchmark datasets are publicly available from their respective original sources.
Funding: This work was supported by the National Natural Science Foundation of China (Grant 62476145); the Humanity and Social Science Foundation of Ministry of Education of China (Grant 24YJAZH126); the 6th ‘333 Talents’ Technology Research and Development Talent Foundation of Jiangsu Province; the Transportation Technology and Achievement Transformation Foundation of Jiangsu Province (Grant 2024G01); the Key Laboratory of Target Cognition and Application Technology (Grant 2023-CXPTLC-005); and the Postgraduate Research & Practice Innovation Program of Jiangsu Province (Grant SJCX25_2019). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. No authors received a salary from any of the funders.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
Accurate and efficient pedestrian trajectory prediction is essential for safe autonomous navigation [1,2] and effective human-robot interaction [3]. In dynamic environments, decision-making systems must anticipate future states by leveraging historical motion patterns and social cues [4]. Although deep learning has substantially advanced predictive performance, real-world deployment still hinges on addressing a key trade-off: achieving strong representational capacity to capture complex spatiotemporal dynamics while maintaining inference efficiency to satisfy real-time latency requirements.
Modeling pedestrian motion faces two structural challenges regarding feature representation and output generation. First, pedestrian movement is governed by two distinct temporal scales: local causal dynamics, representing immediate kinematic reactions to surroundings, and global motion trends, representing consistent long-term directionality. Existing architectures often struggle to balance these. For instance, Transformer-based models like AgentFormer [5] utilize dense self-attention across the full sequence, which entangles local and global cues within a unified attention map rather than explicitly separating them. Conversely, methods like STAR [6] process spatial and temporal dimensions via separate stages. While this design is structured, it lacks an explicit mechanism to disentangle multi-scale temporal dynamics within its processing pipeline.
Second, the inherent multimodality of human behavior—where a single history can lead to multiple plausible futures—demands modeling a distribution of possible trajectories. As illustrated in Fig 1, social interactions create diverse plausible paths from the same observed history. However, generating these hypotheses efficiently remains difficult. Generative approaches based on diffusion models [7] or GANs [8] often incur high computational overhead due to iterative denoising or complex sampling. Among Transformer-based methods, high-fidelity approaches frequently rely on autoregressive (AR) decoding [5,9], which requires sequential forward passes proportional to the prediction horizon Tpred. This linear scaling of latency with forecast length poses a significant challenge for safety-critical, real-time applications.
Given the same observed history, different interaction outcomes can lead to multiple plausible future trajectories (red dashed), while only one is realized as the ground truth (blue). This illustration motivates the need to predict multiple modes to capture the eventual outcome.
To address these limitations, we propose the LGCMT, a framework designed to balance structural efficiency with predictive diversity. LGCMT introduces a dual-branch encoder to explicitly model the hierarchy of motion: a SCT-MSA branch captures short-term, history-dependent kinematic adjustments, while a Cosine Similarity Attention branch extracts macro-level directional trends. To ensure scalability in crowded scenes, we incorporate a distance-based spatial filtering strategy that reduces interaction complexity from quadratic to linear.
For decoding, we replace the standard autoregressive loop with a Library-Guided Non-Autoregressive (NAR) decoder. By retrieving structured motion patterns from a library learned on training trajectories, the model generates a diverse set of candidates in a single decoding pass, reducing the number of forward passes from Tpred to 1.
Our specific contributions are:
- Local-Global Collaborative Encoder: We propose a specialized encoder that disentangles local causal dynamics from global motion trends using parallel sparse-causal and cosine-similarity attention mechanisms. This design explicitly separates immediate reactions from long-term directionality at the attention level.
- Library-Guided NAR Decoding: We introduce a parallel prediction framework that utilizes a learned motion pattern library to guide non-autoregressive generation. This approach eliminates the latency of sequential decoding and avoids the iterative overhead associated with diffusion-style models.
- Efficiency and Accuracy Balance: Validated on ETH/UCY and the large-scale SDD benchmarks, LGCMT achieves competitive accuracy compared to recent baselines. It demonstrates exceptional inference speed (approximately 2.8 ms per sample), making it highly suitable for real-time deployment in complex environments.
2 Related work
This section reviews existing literature focusing on two critical dimensions: spatiotemporal representation learning and the trade-off between multimodality and inference efficiency.
2.1 Spatiotemporal representation and interaction modeling
Early data-driven approaches utilized Recurrent Neural Networks (RNNs) to model sequential dependencies. Social-LSTM [10] introduced social pooling to aggregate neighbor information, a concept refined by subsequent attention-based RNNs [11,12]. While Transformers have gained traction, RNN architectures continue to evolve; for instance, the recent AFC-RNN [13] incorporates adaptive forgetting controllers to explicitly manage historical redundancy, demonstrating the continued relevance of recurrent structures for temporal modeling.
Graph Neural Networks (GNNs) [14] offer a flexible topology for modeling interactions, treating agents as nodes. Approaches like SGCN [15] and Social-STGCNN [16] leverage sparse graph convolutions to capture social effects. However, GNNs typically prioritize spatial topology, often employing simpler temporal aggregation mechanisms compared to attention-based sequence models.
The Transformer architecture [17] addresses long-range dependencies via self-attention. Early adaptations like STAR [6] interleaved spatial graphs with temporal Transformers. More recent works focus on enhancing robustness via specific constraints or additional modules. For example, TP-EGT [18] introduces a collision-aware Graph Transformer within a multi-task framework to explicitly predict interaction probabilities. Similarly, APT-TP [19] utilizes semantic maps and inverse reinforcement learning to enforce fine-grained trajectory-scene consistency. These constraint-aware Transformers primarily add supervision signals, modules, or extra inputs on top of Transformer backbones, instead of explicitly separating local causal dynamics and global trends at the attention level. In contrast, our LGCMT focuses on the intrinsic efficiency of the attention structure itself, employing sparse causal attention to strictly model the arrow of time for local dynamics, distinct from global trend analysis.
2.2 Multimodality and inference efficiency
Pedestrian trajectory prediction is inherently multimodal. Generative Models handle this by learning latent distributions. GANs [3,8] and CVAEs [20,21] sample from latent noise to generate diverse paths. Recently, Diffusion Models [7,22] have achieved high fidelity in distribution modeling but typically require multiple reverse steps for denoising, increasing inference cost. Normalizing Flows [23] offer exact likelihood estimation but often involve complex invertible transformations.
Deterministic Multi-Hypothesis approaches offer an alternative. Many Transformer-based models, such as AgentFormer [5], employ Autoregressive (AR) decoding to generate multimodal distributions. While AR ensures temporal coherence, it requires Tpred sequential forward passes, creating a latency bottleneck. Non-Autoregressive (NAR) approaches, such as query-based set prediction methods like TUTR [24], attempt to generate all time steps simultaneously. LGCMT extends this NAR paradigm by conditioning predictions on a discrete, learned motion pattern library. This structured guidance aims to ensure trajectory plausibility without the computational overhead of iterative sampling or sequential decoding loops.
3 Materials and methods
This section elaborates on the LGCMT, a model developed to address key challenges in pedestrian trajectory prediction. The proposed framework integrates specialized encoding, multi-mode hypothesis generation, and parallel decoding mechanisms. Subsequent subsections detail the problem formulation, the overall architecture, the design of each core component, and the training methodology.
3.1 Problem formulation and model overview
The objective of pedestrian trajectory prediction is defined as follows: Given the observed historical position sequence for a target pedestrian i over the past Tobs time steps, denoted as , and considering the historical trajectories
of neighboring pedestrians
. To efficiently handle crowded scenes and filter out irrelevant interactions, we explicitly define the neighbor set
based on a fixed spatial radius rather than considering all pedestrians in the scene. Specifically, a pedestrian j is included in
if and only if their distance to the target i is within a threshold
:
where denotes the set of all pedestrians in the scene. This distance-based filtering strategy effectively reduces the computational complexity from quadratic relative to the crowd size to linear relative to the number of relevant neighbors, preventing noise from distant agents and ensuring scalability in dense environments.
The task is to predict a set of K plausible future trajectories for pedestrian i over the next Tpred time steps. This output set is represented as , where each individual trajectory hypothesis
signifies a distinct future path. This formulation inherently accommodates the multi-mode nature of pedestrian movement.
Fig 2 illustrates the LGCMT architecture. Input 2D coordinates for the target pedestrian Xi and its neighbors are first embedded into high-dimensional features. The target’s features are then processed by our local-global collaborative encoder (described in the Pedestrian history encoding: A local-global collaborative approach subsection). This encoder has two parallel branches: a Causal Temporal Encoder (CTE) with SCT-MSA to capture local dynamics, and a Global Context Encoder (GCE) with cosine similarity attention for global patterns. Their fused output,
, represents the target’s history. Correspondingly, the historical trajectories of neighbors Xj are processed through a dedicated embedding layer to obtain their feature representations Hj (see the Socially-aware parallel trajectory decoding subsection for details). Next, a pattern scoring module (CLS Head, as detailed in the Structured multi-mode hypothesis generation via motion pattern library subsection) compares
against a pre-constructed motion pattern library
, selecting the top-K motion patterns and their embeddings
. Finally, for each selected pattern, a socially-aware non-autoregressive decoding process (detailed in the Socially-aware parallel trajectory decoding subsection) is initiated. It leverages the target’s mode-specific feature representation, incorporates social context
derived from neighbor features Hj via an attention mechanism, and then utilizes a regression head (REG Head) to generate the complete future trajectory Y(i,m) in a single step, outputting K trajectory candidates
.
The target trajectory Xi and neighbor trajectories are embedded and processed by a local–global collaborative encoder, consisting of a Causal Temporal Encoder (CTE) with SCT-MSA for local dynamics and a Global Context Encoder (GCE) for global motion trends. A motion-pattern library is scored by the CLS head to select top-K modes, and a socially-aware non-autoregressive decoder (REG head) generates K future trajectory hypotheses in parallel.
3.2 Pedestrian history encoding: A local-global collaborative approach
Accurately predicting future pedestrian trajectories hinges on effectively understanding their historical motion. Pedestrian movement is not random; it often comprises a blend of immediate, fine-grained maneuvers and broader, macro-level intentions. To capture this inherent duality, we introduce a local-global collaborative encoder. This encoder processes the observed trajectory , where
represents the 2D coordinates of pedestrian i at time t, and Tobs is the length of the observation period.
The encoding process starts by considering each of the Nlib pre-defined motion patterns from the library (see the Structured multi-mode hypothesis generation via motion pattern library subsection). For each pattern Mk, a complete candidate sequence Sk is formed by concatenating the observed history Xi with the pattern coordinates Mk, spanning the combined observation and prediction horizon (). These Nlib candidate sequences are then processed in parallel through two distinct initial embedding pathways, corresponding to the CTE and GCE branches, as illustrated in Fig 2.
For the CTE path, designed to capture local dynamics, each candidate sequence Sk is processed by a Temporal Embedding module. This module transforms each 2D coordinate pt within Sk into a dmodel-dimensional feature vector :
This yields Nlib initial embedded sequences , which form the input tensor for the CTE branch.
Concurrently, for the GCE path aiming to capture global context, each candidate sequence Sk is processed by a Sequence Embedding module. This module takes the entire sequence Sk as input and generates a single dmodel-dimensional feature vector representing the overall context of that specific mode hypothesis:
The collection of these Nlib vectors, forming a tensor , serves as the input representation for the GCE branch. These distinct embedding strategies ensure that the subsequent CTE and GCE layers receive input features tailored to their respective tasks of local and global pattern extraction.
3.2.1 Causal temporal encoder (CTE) for efficient local dynamic extraction.
The first branch, our CTE, is designed to meticulously capture the fine-grained temporal dynamics from the recent history of a pedestrian’s movement. The cornerstone of the CTE is the SCT-MSA module, illustrated in Fig 3. The design of SCT-MSA intrinsically respects the natural arrow of time in motion by being causal; that is, the feature representation at any time step t is influenced exclusively by past and present information (). Furthermore, SCT-MSA introduces sparsity by confining its attention mechanism to a defined local historical window of size Rwindow. For each time step t, it considers information only from
. This focus on localized, recent history is crucial for capturing immediate motion cues. The combination of causality and local sparse attention significantly enhances computational efficiency, reducing the self-attention complexity from
per layer, typical of standard Transformers [17], to a more favorable
. This makes the CTE well-suited for processing observation sequences where local dependencies are crucial, particularly with large Tobs and small Rwindow.
The mechanism restricts attention to a local sliding window of size Rwindow (shaded grey area), ensuring that the feature representation at time t depends only on the recent history [t − Rwindow, t]. This design enforces causality and reduces computational complexity compared to full self-attention.
Within each SCT-MSA layer, the input sequence (e.g., for the initial layer) is transformed into Query (Q), Key (K), and Value (V) vectors via distinct linear projections (
):
where hin denotes the feature input to the current layer. Attention scores are then computed using scaled dot-product attention, where a mask, , rigorously enforces both causality and the sparse local window by permitting attention only to positions
within the allowed range:
Here, is the dimension per attention head (Nh heads total). After Softmax normalization, the output feature
is a weighted sum of Value vectors from the defined causal sparse window:
The SCT-MSA module is then completed with residual connections, Layer Normalization, and a position-wise Feed-Forward Network (FFN), adhering to the standard Transformer block structure [17]. Stacking LCTE such layers empowers the CTE branch to produce , a feature sequence rich in detailed, short-term motion characteristics.
3.2.2 Global context encoder (GCE) for macro-level patterns.
The second branch of our encoder, the GCE, focuses on discerning broader, macro-level patterns inherent to the target pedestrian’s own movement. It is crucial to clarify that “Global” in this context refers to the temporal global scope of the trajectory sequence, rather than the spatial global scope of the crowd.
Unlike the Social Decoder which handles agent-agent interactions, the GCE is strictly an intra-agent module. It processes the entire candidate sequence (comprising the observed history and a hypothesized future motion mode) as a single input. This holistic view allows the GCE to extract overarching behavioral trends specific to the target’s individual motion intent, independent of social interactions. By isolating the individual’s long-term goal from transient social perturbations, the GCE provides a stable representation of long-term intent.
The GCE consists of LGCE Transformer encoder layers [17], distinguished by its use of Cosine Similarity Attention. Query (Qt) and Key () vectors are generated as in the CTE. Their similarity, however, is measured using cosine similarity, which emphasizes their directional relationship rather than dot product magnitude, potentially offering a better grasp of overall motion intent:
Attention scores are derived by scaling this similarity with a learnable parameter (an optional mask
is typically unused to allow full global interaction):
Following Softmax normalization:
The output feature aggregates information from all historical time steps:
Stacking LGCE such modified Transformer layers, each incorporating standard Layer Normalization and an FFN, yields , a sequence encoding the global contextual information of the trajectory.
3.2.3 Feature fusion for comprehensive understanding and computational considerations.
Having extracted detailed local dynamics () with the CTE and broad global patterns (
) with the GCE, these complementary representations are integrated to form a holistic understanding of the pedestrian’s historical movement. This is achieved through an element-wise summation:
This serves as the comprehensive historical representation for subsequent model stages.
This dual-branch architecture reflects a deliberate design choice regarding computational complexity. The CTE, with its SCT-MSA module, reduces per-layer attention complexity to approximately from the standard
. Its overall complexity (including FFNs of hidden dimension dff) is roughly
. This renders the CTE highly efficient for long sequences where
. Conversely, the GCE maintains
per-layer attention complexity to capture all-pairs global context, leading to an overall GCE complexity of approximately
. The fusion step adds negligible
complexity. This design allows our model to balance computational load: the CTE efficiently distills local causal dynamic with complexity linear in Tobs (for fixed Rwindow), while the GCE, though more intensive, extracts indispensable global context, achieving a synergistic blend of expressive power and efficiency.
3.3 Structured multi-mode hypothesis generation via motion pattern library
Pedestrian behavior is inherently multi-mode, meaning individuals often have several plausible future paths. To effectively address this diversity while avoiding the inefficiencies and complex post-processing associated with some traditional generative models, we introduce a structured hypothesis generation approach. This method is centered on a pre-constructed motion pattern library, a collection denoted as . This library comprises Nlib distinct, typical future motion patterns, each Mk representing a sequence of 2D coordinates over the prediction horizon Tpred.
The creation of this motion pattern library is an offline process performed once using the training dataset. Initially, a large corpus of future trajectory segments, each spanning Tpred time steps, is collected. To ensure that the learned patterns represent general motion characteristics rather than absolute starting positions, each trajectory segment is normalized, for instance, by translating its initial point to the origin. Following normalization, K-Means clustering, a standard unsupervised learning algorithm, is applied to these trajectory segments. K-Means groups similar trajectories together, and the centroid of each resulting cluster becomes a distinct motion pattern Mk in our library . The number of patterns, Nlib, is a hyperparameter chosen based on dataset characteristics and desired granularity. Each raw 2D coordinate sequence Mk is then transformed into a learnable dmodel-dimensional feature vector,
, using an embedding layer denoted
. This allows the model to work with richer representations of these patterns.
During inference, the model utilizes the fused feature tensor obtained after the local-global feature fusion step (see the Feature fusion for comprehensive understanding and computational considerations subsubsection). This tensor, with dimensions reflecting the batch size, the Nlib mode hypotheses, and the feature dimension (B × Nlib × dmodel), encapsulates the comprehensive representation for each potential future scenario. This entire tensor
is then directly fed into the scoring network, MLPscore (referred to as the CLS Head in Fig 2). The scoring network, implemented as a linear layer mapping from dmodel to 1, operates independently on the feature vector corresponding to each of the Nlib mode hypotheses:
where Scoresi is now understood as a tensor of shape B × Nlib, containing the calculated score for each of the Nlib patterns based on their respective fused representations.
These scores are converted into a probability distribution pi over the patterns via the Softmax function, where p(i,k) indicates the likelihood of pattern Mk for pedestrian i:
For training, target modes and probabilities guide the learning of Lpred and Lmode (detailed in the Training strategy and loss functions subsection). In inference, the top-K patterns, identified by , and their embeddings
direct the parallel generation of multiple trajectory hypotheses.
This library-based mechanism enhances prediction quality and interpretability by incorporating explicit prior knowledge of common behaviors, guiding generation towards plausible outcomes. While the library’s coverage limits its ability to represent entirely novel behaviors and discretizing motion might lose some nuances, its careful construction is expected to significantly improve the generation of diverse and realistic trajectory candidates. The quality and representativeness of the library are key considerations.
3.4 Socially-aware parallel trajectory decoding
Effective trajectory prediction requires not only understanding an individual’s past movement and intended goals but also their dynamic interactions with surrounding individuals. The Socially-Aware Parallel Trajectory Decoding approach proposed in this paper addresses this by simultaneously generating multiple future path hypotheses for a target pedestrian, each explicitly conditioned on the dynamic social context. This capability is vital for creating realistic and reliable predictions, particularly in scenarios with complex pedestrian interplay where movements are heavily interdependent.
To incorporate these social influences efficiently, the decoder utilizes the neighbor set identified via the spatial distance threshold
(as defined in Section 3.1). This design is critical for scalability. Let P denote the total number of pedestrians in the scene and Nmax be the maximum number of neighbors considered (set to 50 in our experiments). While global attention mechanisms inherently suffer from quadratic complexity
, our distance-based filtering reduces the interaction scope to a local subset. Consequently, the computational cost for the social module scales linearly
. This ensures that the model remains lightweight and responsive even in high-density crowds. The historical trajectory Xj of each neighbor
is processed to obtain its summarized context vector cj. In our proposed LGCMT model, this is achieved by first flattening its historical trajectory Xj into a single vector, which is then processed through a dedicated linear embedding layer. This approach provides computationally efficient contextual representations cj for each neighboring agent, suitable for the subsequent social attention module.
For each of the K motion patterns selected from the library (as described in the Structured multi-mode hypothesis generation via motion pattern library subsection), the decoding process generates a corresponding future trajectory Y(i,m). This generation process is non-autoregressive, predicting all Tpred future coordinates simultaneously for enhanced inference speed. The process begins with a feature representation for the target pedestrian i that is specific to the selected mode m. This feature, let’s denote it as , is derived from the target’s fused historical features
and incorporates the corresponding pattern embedding
.
Crucially, social context is then integrated in a mode-specific manner using a social attention mechanism. This mechanism takes the target pedestrian’s mode-specific feature as the query, while the set of neighbor context vectors
serve as keys and values. This allows the model to dynamically weigh the influence of each surrounding individual conditioned specifically on the motion hypothesis m being considered:
This yields a socially-informed, mode-specific feature vector, , which encapsulates the target’s history, the specific motion pattern’s influence, and relevant social interactions pertinent to that mode.
This resultant feature is then fed directly into a shared regression network, MLPreg. This network outputs the complete Tpred-step future trajectory Y(i,m) for the corresponding mode m:
This entire procedure, from preparing the mode-specific query to final regression, is executed in parallel for each of the K selected modes, efficiently producing the diverse set of trajectory candidates.
This parallel and non-autoregressive design efficiently produces a diverse set of trajectory candidates. By bypassing the iterative recurrence of traditional LSTM-based decoders (which require Tpred sequential steps), our regression head generates the full prediction horizon in a single forward pass (
temporal complexity). Combined with the linear spatial complexity of the social attention, this architecture achieves a significant reduction in inference latency. The explicit modeling of social interactions makes this decoding strategy particularly effective in complex real-world environments, while the optimized computational design ensures adaptability across varying crowd densities.
Algorithm 1: LGCMT Prediction Process
3.5 Training strategy and loss functions
The model is trained by minimizing a composite loss function, Ltotal, designed to ensure both prediction accuracy and appropriate mode selection. This total loss is a weighted sum of a trajectory prediction loss Lpred and a mode classification loss Lmode, balanced by hyperparameters and
:
The trajectory prediction loss, Lpred, is formulated to guide the model towards accurate trajectory generation conditioned on the most suitable motion pattern during the training phase. Specifically, for each ground truth future trajectory , we first identify the index m* of the motion pattern
within the library
that exhibits the highest similarity to
. The model is then explicitly trained to produce a single trajectory prediction,
, by utilizing only the computational path associated with this best-matching mode m*. The prediction loss Lpred is subsequently computed as the Smooth L1 loss between this specific prediction
and the ground truth
:
where represents the Smooth L1 loss function, typically averaged over all predicted time steps and samples. This training approach ensures that the learning signal focuses on generating accurate predictions from the identified ‘target’ mode, differing fundamentally from the inference-time evaluation procedure where the best among keval generated hypotheses is selected based on distance metrics.
The mode classification loss, Lmode, guides the model to identify appropriate underlying motion patterns from the library. A target “soft” probability distribution qi over the Nlib library patterns is first derived by comparing the normalized ground truth future to each normalized library pattern Mk. Similarity scores s(i,k) are converted to probabilities q(i,k) using a Softmax function with temperature
:
Lmode is then the cross-entropy between the model’s predicted mode distribution pi (shown in Eq 13) and this target qi:
Jointly optimizing Lpred and Lmode promotes high-quality, diverse, and contextually appropriate predictions. Algorithm 1 summarizes the complete forward pass of our proposed model.
4 Results
4.1 Experimental setup
4.1.1 Datasets.
We evaluate the LGCMT model on the widely used ETH [25] and UCY [26] benchmarks. These benchmarks comprise five distinct scenes, namely ETH, HOTEL, UNIV, ZARA1, and ZARA2, featuring varied pedestrian densities and interaction patterns. The data consists of 2D coordinates recorded at 2.5 Hz. We adhere to standard evaluation protocols [8,10], observing trajectories for 8 time steps (corresponding to 3.2 seconds, denoted Tobs) and predicting the subsequent 12 steps (covering 4.8 seconds, denoted Tpred). A Leave-One-Out Cross-Validation (LOOCV) strategy is employed for evaluation, where the model is trained on four scenes and tested on the remaining fifth scene, iterating this process so that each scene serves as the test set once. Input trajectory observations undergo normalization: the starting point is translated to the origin, and the trajectory is rotated to align its initial motion direction approximately with the X-axis.
In addition, we evaluate our model on the Stanford Drone Dataset (SDD) [27], a large-scale benchmark with higher crowd density and scene complexity. For fair comparison, we adopt the same Tobs = 8 and Tpred = 12 settings.
4.1.2 Evaluation metrics.
Model performance is quantified using the Average Displacement Error (ADE) and Final Displacement Error (FDE). ADE measures the mean L2 distance between the predicted trajectory and the ground truth trajectory
over all predicted time steps:
FDE calculates the L2 distance specifically at the final predicted time step Tpred:
To account for the inherent multi-modality of future trajectories, we follow common practice [5] by generating K potential trajectories and reporting the minimum ADE and FDE among these candidates, denoted minADEK and minFDEK, averaged across all test samples. Unless otherwise specified, we use K = 20. Lower ADE and FDE values indicate better prediction accuracy.
4.1.3 Hyperparameter settings.
The proposed LGCMT model was implemented using the PyTorch framework, and all experiments were conducted on an NVIDIA RTX 4070 GPU. The core motion mode library is constructed offline via K-Means clustering. The size of the motion library Nlib was optimized per scene: 50 for ETH, 90 for HOTEL, 50 for UNIV, 70 for ZARA1, and 50 for ZARA2, while Nlib = 100 was used for the SDD. During inference, we select the top K = 20 modes, consistent with the evaluation protocol. Regarding the neighbor selection strategy defined in Section 3.1, we set the spatial distance threshold to 2.0 meters for the ETH/UCY datasets and 5.0 units for the SDD to capture socially significant interactions within the respective spatial scales.
Regarding model architecture, the core hidden dimension dmodel was set to 256 for the ETH and HOTEL datasets, 128 for the UNIV, ZARA1, and ZARA2 datasets, and 64 for the SDD. The local-global collaborative encoder employs 3 stacked Transformer blocks for the ETH dataset and 2 blocks for all other datasets (including SDD). All multi-head attention mechanisms within the encoders and decoders utilize 4 attention heads. The local history window size Rwindow for the SCT-MSA module was also tuned specifically for each scene: 4 for ETH and HOTEL, 7 for UNIV and ZARA1, 5 for ZARA2, and 3 for the SDD. For training, we used a batch size of 128 and adhered to the optimization strategy detailed in the Training strategy and loss functions subsection.
4.2 Comparison with existing methods
To rigorously evaluate the proposed LGCMT framework, we compare it against a comprehensive set of contemporary methods detailed in Table 1. These include Social-STGCNN [16], STAR [6], PECNet [28], AgentFormer [5], SGCN [15], SIT [29], MemoNet [30], SocialVAE [21], TUTR [24], BCDiff [7], FlowChain [23], SMEMO [4], Social NSTransformers [32], TP-EGT [18], TPPO [31], Social Entropy Informer [33], Social Informer [34], and W-DGTrans [35].
4.2.1 Performance on ETH/UCY datasets.
On the standard ETH and UCY benchmarks, LGCMT demonstrates robust performance, achieving the lowest average errors across all compared methods with a minADE of 0.20 and a minFDE of 0.34. This represents a substantial improvement over earlier baselines and a competitive edge over the most recent approaches.
Specifically, we compare our method against the recent works of TP-EGT [18] and TPPO [31] as highlighted in recent literature. LGCMT outperforms the graph-transformer-based TP-EGT (Average ADE 0.23) by approximately 13.0% and significantly surpasses the pose-optimization-based TPPO (Average ADE 0.39) with a reduction in error of roughly 48.7%. Furthermore, compared to the 2026 baseline W-DGTrans [35], which reports an average ADE of 0.21 meters, our model maintains a performance advantage, particularly in the HOTEL and ZARA2 scenes where distinct motion patterns and social interactions are prevalent.
The breakdown by scene reveals that LGCMT achieves the best reported ADE in the HOTEL (0.11), UNIV (0.23), and ZARA2 (0.14) subsets. This consistency across datasets with varying pedestrian densities validates the effectiveness of the local-global collaborative encoder. By simultaneously capturing the fine-grained local dynamics and the long-term individual intent, the model effectively mitigates the trade-off often observed in other methods that may overfit to specific scene types.
4.2.2 Performance on SDD dataset.
To address the limitations associated with the relatively small scale of the ETH/UCY datasets and to test the model’s scalability in high-density, real-world environments, we extended our evaluation to the SDD [27]. As shown in Table 2, SDD presents significantly greater challenges due to its bird’s-eye view, diverse agent types (including cyclists and skaters), and complex static obstacles.
In this rigorous benchmark, LGCMT achieves a minADE of 7.90 pixels and a minFDE of 13.04 pixels. These results surpass those of competitive baselines, including TUTR [24] (7.99 pixels ADE) and SMEMO [4] (8.11 pixels ADE). The superior performance on SDD is particularly noteworthy as it confirms that the proposed spatial neighbor filtering strategy allows the model to scale efficiently to crowded scenes. Unlike global attention mechanisms that may suffer from noise accumulation when processing dozens of agents, our method maintains precision by focusing on socially relevant neighbors, thereby demonstrating strong robustness and generalization capabilities in complex, unstructured environments.
4.3 Ablation study
To rigorously validate the architectural choices underpinning the LGCMT model and quantify their individual contributions, we conducted comprehensive ablation studies on the ETH/UCY datasets. These experiments assessed the impact of removing or altering key components on both predictive accuracy, measured by average minimum ADE and FDE over 20 samples in meters, and computational efficiency via average inference time in milliseconds. All ablation experiments maintained the primary experimental setup on an NVIDIA RTX 4070 GPU.
4.3.1 Component effectiveness analysis.
We first investigated the contribution of LGCMT’s primary architectural modules by systematically removing or modifying them. The results, detailed in Table 3, reveal the significance of each design decision.
The necessity of the local-global collaborative encoding strategy is immediately apparent. Removing the global context encoder, hereafter referred to as GCE, led to a substantial performance decline; average ADE increased by 20.0% to 0.24 and average FDE rose by 20.6% to 0.41. This underscores the critical role of the GCE in capturing broader trajectory trends and inferring longer-term pedestrian intent. Similarly, ablating the causal temporal encoder, hereafter CTE, also hampered predictive accuracy, elevating average ADE by 10.0% to 0.22 and FDE by 11.8% to 0.38. While this impact was less severe than removing the global context, it confirms the importance of the CTE’s fine-grained local motion modeling. These findings collectively indicate that relying solely on a single temporal scale is insufficient; LGCMT’s strength derives from synergistically integrating local dynamics captured by the CTE with global motion understanding provided by the GCE.
The experiments further highlight the profound impact of the prediction guidance mechanism. The most significant performance degradation occurred when the motion mode library was removed. Without this guidance, average ADE surged by 60.0% to 0.32, and FDE increased dramatically by 70.6% to 0.58. This starkly illustrates that generating structured multi-mode hypotheses, informed by learned motion patterns, is fundamental to the model’s accuracy and its capacity for diverse, plausible predictions. Additionally, neglecting social interactions proved detrimental. Removing social context modeling resulted in a 40.0% increase in average ADE to 0.28 and a 47.1% increase in average FDE to 0.50, confirming that accounting for the influence of nearby pedestrians remains crucial for realistic trajectory forecasting, particularly in interactive settings.
Finally, the specific choice of attention mechanisms within the encoders was validated. Replacing the specialized Sparse Causal Temporal Multi-head Self-Attention, known as SCT-MSA, in the CTE with standard self-attention resulted in a noticeable drop in performance, yielding an average ADE/FDE of 0.21/0.37. A similar decline to 0.22/0.37 was observed when the GCE’s cosine similarity attention was substituted with standard dot-product attention. These outcomes affirm that the tailored designs of SCT-MSA for local temporal dependencies and cosine similarity attention for global directional trends are more effective within their respective LGCMT modules than generic attention approaches.
4.3.2 Complexity and inference speed analysis.
Beyond predictive accuracy, the practical utility of a forecasting model relies heavily on its computational efficiency. To comprehensively evaluate the real-time performance of the algorithm, we analyzed three key indicators: model parameters (Params), computational complexity (FLOPs), and actual Inference Time.
To ensure a fair and rigorous comparison, we established a unified evaluation protocol. All models listed in Table 4 were re-evaluated on a workstation equipped with an NVIDIA GeForce RTX 4070 GPU. We strictly followed the standard real-time latency evaluation criteria: the inference time was measured with a batch size of 1 and a sampling number of K = 20. Furthermore, all reported data are the average of measurements taken after a warm-up period to exclude system initialization noise and random fluctuations. The values are averaged across the five ETH/UCY datasets to mitigate biases from specific scene characteristics.
It is worth noting that the inference time for the same algorithm varies across different scenarios. This variance is primarily attributed to crowd density. Most interaction-aware models, including LGCMT, employ mechanisms where the computational cost scales with the number of agents in the frame. Consequently, densely populated scenarios (such as UNIV) naturally incur slightly higher latency compared to sparse scenarios (such as ETH).
As shown in Table 4, LGCMT achieves an optimal balance between accuracy and efficiency (2.79 ms). When compared to complex Transformer-based architectures, our model demonstrates a significant speed advantage. For instance, MemoNet (156.96 ms) involves processing continuous features alongside high-overhead retrieval operations from an external memory bank, which inherently limits its inference speed. Similarly, AgentFormer (46.58 ms) requires heavy computation for its dense attention mechanisms. Regarding SocialVAE, while it achieves competitive speed (16.27 ms), this metric is recorded without its “Fitted Posterior Check (FPC).” Although FPC can improve precision, it increases latency to approximately 2.8 seconds per scene, rendering it unsuitable for real-time applications. While lightweight baselines like Social-STGCNN (1.59 ms) and TUTR (1.89 ms) are marginally faster, LGCMT maintains a comparable millisecond-level speed while offering more robust trajectory modeling capabilities. This analysis confirms that LGCMT is well-suited for deployment in dynamic environments where both accuracy and low latency are critical.
Impact of Hidden Dimensions: To further explore the scalability and the trade-off between model capacity and efficiency, we conducted an ablation study on the SDD dataset by varying the hidden dimension size (). As detailed in Table 5, increasing the hidden dimension from 16 to 256 leads to a substantial increase in computational cost: parameters grow from 0.08 M to 2.47 M, and FLOPs double from 0.13 G to 0.26 G. However, this increase in complexity does not strictly correlate with performance gains. The model achieves its best predictive accuracy (ADE = 7.90) at H = 64, with an inference time of just 3.02 ms. Interestingly, larger dimensions result in slight performance degradation (ADE = 8.12), likely due to overfitting on the trajectory data. Conversely, an extremely small dimension (H = 16) limits the model’s representational power, leading to higher errors. Consequently, we selected H = 64 as the optimal configuration for our main experiments, as it minimizes computational overhead while maximizing prediction accuracy.
4.3.3 Robustness analysis.
To verify the stability and reproducibility of LGCMT, we conducted repeated experiments using 5 different random seeds on both the ETH/UCY and SDD datasets. Table 6 reports the statistics in the format of Mean ± Standard Deviation (Std). While the main results in Table 1 report the best-performing model to ensure a fair comparison with baselines, the results here demonstrate that the deviation between the mean performance and the best run is minimal. For instance, on the challenging ETH scene, the mean ADE is 0.37m compared to the best run of 0.36m. The extremely low standard deviations (e.g., ±0.001 on HOTEL and UNIV) confirm that LGCMT is robust to initialization and achieves consistent performance.
4.3.4 Impact of non-autoregressive decoding.
To isolate the contribution of our chosen decoding strategy, we explicitly compared the standard non-autoregressive (NAR) LGCMT against an autoregressive (AR) counterpart. This baseline, termed LGCMT (AR), utilized the identical encoder architecture but employed a traditional sequential decoder. Table 7 presents the comparison regarding both predictive accuracy, ADE/FDE, and inference time.
The advantages of the NAR approach are clear. Regarding predictive accuracy, the NAR model significantly outperformed its AR variant. Average ADE decreased by 25.9% from 0.27 to 0.20, and average FDE saw a 17.1% reduction from 0.41 to 0.34. This accuracy enhancement likely arises from the NAR mechanism generating the full sequence concurrently, mitigating the error accumulation often problematic in step-by-step autoregressive predictions.
The efficiency benefits are even more striking. The NAR-based LGCMT required only 2.79 milliseconds for inference on average. This is approximately 31 times faster than the 87.55 milliseconds needed by the LGCMT (AR) model. This dramatic speed-up is a direct result of the NAR decoder’s parallel computation across all future time steps, contrasting sharply with the inherent sequential processing of AR decoding. This finding strongly advocates for the NAR strategy in applications demanding both high accuracy and rapid response times.
4.3.5 Key hyperparameter sensitivity analysis.
We further investigated how LGCMT’s performance responds to variations in two key hyperparameters: the motion mode library size Nlib, and the local history window size Rwindow used within the SCT-MSA module. Our primary results employed scene-specific optimal values for Nlib, ranging from 50 to 90, and for Rwindow, ranging from 4 to 7, achieving the benchmark 0.20/0.34 average ADE/FDE. The following analysis explores performance when these hyperparameters are set uniformly across all scenes.
First, varying the motion mode library size Nlib uniformly across values {30, 50, 70, 90, 110} revealed performance trends depicted in Fig 4. A small library size where Nlib equals 30 yielded a noticeable drop to 0.23/0.40 average ADE/FDE, suggesting insufficient capacity to capture motion diversity. Performance stabilized around 0.21 average ADE and 0.35–0.36 average FDE for Nlib between 50 and 90. Increasing Nlib further to 110 resulted in a slight degradation to 0.22/0.37. This indicates that while a sufficiently large library is crucial, an excessively large Nlib offers diminishing returns and can slightly dilute performance, reinforcing the benefit of scene-specific tuning.
Bars report the average minADE/minFDE when a single Nlib is used for all scenes. Horizontal dashed lines denote the baseline obtained with scene-specific optimal Nlib. Performance is stable for , while per-scene tuning achieves the best accuracy.
Next, evaluating the influence of the local history window size Rwindow with uniform values from {2, 3, 4, 5, 6, 7}, as shown in Fig 5, indicated greater robustness compared to variations in Nlib. Across the tested range, average ADE remained between 0.21 and 0.22, and average FDE between 0.35 and 0.37. Settings such as Rwindow = 4 or Rwindow = 5 produced a solid 0.21/0.35 average ADE/FDE. Even extreme values did not cause sharp performance drops. This suggests LGCMT is relatively insensitive to the exact local window size, although optimal scene-specific selection, as used in our main experiments, can provide marginal gains, further validating our adaptive configuration strategy.
Bars report the average minADE/minFDE when a single is used for all scenes. Horizontal dashed lines denote the baseline obtained with scene-specific optimal Rwindow. The model is robust to Rwindow, with stable performance around Rwindow = 4–5.
4.3.6 Visualization.
We complement our quantitative evaluation with qualitative visualizations in real-world scenarios selected from the ETH and UCY test sets.
Fig 6 showcases the diversity of predictions generated by LGCMT (K = 20). The visualization confirms that our library-guided approach can hypothesize varied yet plausible outcomes, effectively covering the ground truth. This ability to model distribution spread is crucial for capturing motion uncertainty in dynamic environments.
(A) ETH, (B) HOTEL, (C) UNIV, (D) ZARA1, and (E) ZARA2. Observed trajectories are shown in green, ground-truth futures in blue, predicted trajectories are shown as red dashed lines, and the best prediction is highlighted in solid red.
Furthermore, Fig 7 compares the best-predicted trajectory of LGCMT against the TUTR baseline [24]. In scenarios requiring complex maneuvers, such as navigating through crowds or approaching destinations, LGCMT demonstrates stronger adherence to the ground truth. Unlike the baseline, which may struggle with sudden directional changes, our model effectively captures fine-grained dynamics. Collectively, these visualizations validate that the proposed Local-Global Collaborative Encoder and NAR decoder successfully capture intricate pedestrian dynamics.
(A) ETH, (B) HOTEL, (C) UNIV, and (D) ZARA. Observations are shown in green and ground truth in blue. LGCMT is shown in red and the baseline is shown in orange.
5 Discussion
The experimental results presented offer critical insights into the mechanisms underpinning LGCMT’s performance. The model’s effectiveness is rooted in the synergistic collaboration of its architectural components, which balance historical interpretation, structured guidance, and operational efficiency.
The ablation studies confirm that accurate prediction requires disentangling multi-scale temporal dynamics. The significant performance degradation observed when removing either the Global Context Encoder or the Causal Temporal Encoder underscores that neither short-term kinematics nor long-term intent is sufficient in isolation. By integrating these through specialized attention mechanisms (SCT-MSA and Cosine Similarity), LGCMT effectively captures the duality of pedestrian motion—reacting to immediate surroundings while maintaining a consistent destination.
Furthermore, the motion mode library proves to be a cornerstone for ensuring predictive diversity. By constraining trajectory generation to a learned set of behavioral patterns, the model effectively mitigates mode collapse and unrealistic path generation. This structured guidance, coupled with explicit social interaction modeling, ensures that predictions are not only diverse but also socially compliant. The strategic choice of the Library-Guided NAR decoder is further validated by the efficiency analysis. By eliminating the sequential bottleneck of autoregressive models, LGCMT achieves a dramatic inference speedup (approximately 30×) while maintaining high accuracy, confirming its suitability for latency-critical real-world applications.
Despite its demonstrated effectiveness, LGCMT offers several clear directions for further improvement. First, the reliance on a pre-constructed motion library means generalization is tied to the diversity of the training data. Future work could explore online adaptation or dynamic mode discovery to strengthen robustness under out-of-distribution behaviors.
Second, in terms of environmental context, the current framework relies solely on trajectory data and therefore does not explicitly model map semantics or obstacle geometry. Although the model can indirectly infer walkable regions from agents’ historical behaviors, it lacks explicit physical grounding to ensure collision-free predictions when facing complex static structures in highly organized scenes. This choice was made to prioritize computational efficiency and emphasize dynamic social interactions. Nonetheless, incorporating a lightweight, scene-centric branch to process semantic maps or occupancy grids would be a natural next step. Such an extension could improve generalization in navigation-intensive environments while largely preserving the inference efficiency of our architecture.
Finally, expanding the evaluation to broader domains is a promising direction. Beyond pedestrian dynamics, complex multi-agent interactions are prevalent in sports analytics, such as the NBA dataset. While distinct in input modality and team strategies, such scenarios share the need for modeling coupled spatiotemporal behaviors. Adapting LGCMT to handle these domain-specific constraints would be a valuable step toward testing the cross-domain generalizability of our local-global and library-guided approach.
6 Conclusion
This paper presented LGCMT, an efficient pedestrian trajectory prediction framework that adeptly captures the complex, multi-modal nature of human movement. The core strength of LGCMT lies in its innovative local-global collaborative encoder, which synergistically employs sparse causal temporal attention for local dynamics and cosine similarity attention for global patterns to construct a comprehensive historical representation. To address predictive diversity, we introduced a structured hypothesis mechanism guided by a motion mode library, ensuring the generation of varied yet plausible futures. By integrating social interaction modeling with an efficient non-autoregressive parallel decoder, LGCMT not only achieves competitive accuracy on the standard ETH/UCY benchmarks but also demonstrates robust scalability on the challenging SDD. These results confirm that LGCMT offers a compelling balance of performance and efficiency, making it highly suitable for practical deployment in real-world applications.
References
- 1. Chen X, Zhang H, Deng F, Liang J, Yang J. Stochastic Non-Autoregressive Transformer-Based Multi-Modal Pedestrian Trajectory Prediction for Intelligent Vehicles. IEEE Trans Intell Transport Syst. 2024;25(5):3561–74.
- 2. Lian J, Yu F, Li L, Zhou Y. Causal Temporal–Spatial Pedestrian Trajectory Prediction With Goal Point Estimation and Contextual Interaction. IEEE Trans Intell Transport Syst. 2022;23(12):24499–509.
- 3. Yang C, Pan H, Sun W, Gao H. Social Self-Attention Generative Adversarial Networks for Human Trajectory Prediction. IEEE Trans Artif Intell. 2024;5(4):1805–15.
- 4. Marchetti F, Becattini F, Seidenari L, Bimbo AD. SMEMO: Social Memory for Trajectory Forecasting. IEEE Trans Pattern Anal Mach Intell. 2024;46(6):4410–25. pmid:38252585
- 5.
Yuan Y, Weng X, Ou Y, Kitani K. AgentFormer: Agent-Aware Transformers for Socio-Temporal Multi-Agent Forecasting. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021. 9793–803. https://doi.org/10.1109/iccv48922.2021.00967
- 6.
Yu C, Ma X, Ren J, Zhao H, Yi S. Spatio-Temporal Graph Transformer Networks for Pedestrian Trajectory Prediction. Lecture Notes in Computer Science. Springer International Publishing. 2020. 507–23. https://doi.org/10.1007/978-3-030-58610-2_30
- 7.
Chen G, Li C, Li R, Ren D, Wang G, Yuan Y. BCDiff: Bidirectional Consistent Diffusion for Instantaneous Trajectory Prediction. In: Advances in Neural Information Processing Systems 36, 2023. 14400–13. https://doi.org/10.52202/075280-0633
- 8.
Gupta A, Johnson J, Fei-Fei L, Savarese S, Alahi A. Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018. 2255–64. https://doi.org/10.1109/cvpr.2018.00240
- 9.
Giuliari F, Hasan I, Cristani M, Galasso F. Transformer Networks for Trajectory Forecasting. In: 2020 25th International Conference on Pattern Recognition (ICPR), 2021. 10335–42. https://doi.org/10.1109/icpr48806.2021.9412190
- 10.
Alahi A, Goel K, Ramanathan V, Robicquet A, Fei-Fei L, Savarese S. Social LSTM: Human Trajectory Prediction in Crowded Spaces. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 961–71. https://doi.org/10.1109/cvpr.2016.110
- 11.
Sadeghian A, Kosaraju V, Sadeghian A, Hirose N, Rezatofighi H, Savarese S. SoPhie: An Attentive GAN for Predicting Paths Compliant to Social and Physical Constraints. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 1349–58. https://doi.org/10.1109/cvpr.2019.00144
- 12.
Shafiee N, Padir T, Elhamifar E. Introvert: Human Trajectory Prediction via Conditional 3D Attention. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 16810–20. https://doi.org/10.1109/cvpr46437.2021.01654
- 13. Dong Y, Wang L, Zhou S, Tang W, Hua G, Sun C. AFC-RNN: Adaptive Forgetting-Controlled Recurrent Neural Network for Pedestrian Trajectory Prediction. IEEE Trans Pattern Anal Mach Intell. 2025;47(11):10177–91. pmid:40742860
- 14.
Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. In: Proceedings of the 5th International Conference on Learning Representations (ICLR), Toulon, France, 2017. 1–14. https://openreview.net/forum?id=SJU4ayYgl
- 15.
Shi L, Wang L, Long C, Zhou S, Zhou M, Niu Z, et al. SGCN:Sparse Graph Convolution Network for Pedestrian Trajectory Prediction. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 8990–9. https://doi.org/10.1109/cvpr46437.2021.00888
- 16.
Mohamed A, Qian K, Elhoseiny M, Claudel C. Social-STGCNN: A Social Spatio-Temporal Graph Convolutional Neural Network for Human Trajectory Prediction. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 14412–20. https://doi.org/10.1109/cvpr42600.2020.01443
- 17.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 2017. 6000–10.
- 18. Yang B, Fan F, Ni R, Wang H, Jafaripournimchahi A, Hu H. A Multi-Task Learning Network With a Collision-Aware Graph Transformer for Traffic-Agents Trajectory Prediction. IEEE Trans Intell Transport Syst. 2024;25(7):6677–90.
- 19. Ni R, Lu S, Hu C, Yang B. Adaptive Progressive Transformer-Based Trajectory Prediction Under Fine-Grained Trajectory-Scene Interaction Constraint. IEEE Trans Automat Sci Eng. 2025;22:24498–509.
- 20.
Lee N, Choi W, Vernaza P, Choy CB, Torr PHS, Chandraker M. DESIRE: Distant Future Prediction in Dynamic Scenes with Interacting Agents. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 2165–74. https://doi.org/10.1109/cvpr.2017.233
- 21.
Xu P, Hayet J-B, Karamouzas I. SocialVAE: Human Trajectory Prediction Using Timewise Latents. Lecture Notes in Computer Science. Springer Nature Switzerland. 2022. 511–28. https://doi.org/10.1007/978-3-031-19772-7_30
- 22.
Gu T, Chen G, Li J, Lin C, Rao Y, Zhou J, et al. Stochastic Trajectory Prediction via Motion Indeterminacy Diffusion. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 17092–101. https://doi.org/10.1109/cvpr52688.2022.01660
- 23.
Maeda T, Ukita N. Fast Inference and Update of Probabilistic Density Estimation on Trajectory Prediction. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 9761–71. https://doi.org/10.1109/iccv51070.2023.00898
- 24.
Shi L, Wang L, Zhou S, Hua G. Trajectory Unified Transformer for Pedestrian Trajectory Prediction. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 9641–50. https://doi.org/10.1109/iccv51070.2023.00887
- 25.
Pellegrini S, Ess A, Schindler K, van Gool L. You’ll never walk alone: Modeling social behavior for multi-target tracking. In: 2009 IEEE 12th International Conference on Computer Vision, 2009. 261–8. https://doi.org/10.1109/iccv.2009.5459260
- 26. Lerner A, Chrysanthou Y, Lischinski D. Crowds by Example. Computer Graphics Forum. 2007;26(3):655–64.
- 27.
Robicquet A, Sadeghian A, Alahi A, Savarese S. Learning social etiquette: Human trajectory understanding in crowded scenes. In: Computer Vision – ECCV 2016. vol. 9912. Amsterdam, The Netherlands: Springer. 2016. 549–65.
- 28.
Mangalam K, Girase H, Agarwal S, Lee KH, Adeli E, Malik J. It is not the journey but the destination: Endpoint conditioned trajectory prediction. In: Computer Vision – ECCV 2020: 16th European Conference, Part II, Glasgow, UK, 2020. 759–76.
- 29. Shi L, Wang L, Long C, Zhou S, Zheng F, Zheng N, et al. Social Interpretable Tree for Pedestrian Trajectory Prediction. AAAI. 2022;36(2):2235–43.
- 30.
Xu C, Mao W, Zhang W, Chen S. Remember Intentions: Retrospective-Memory-based Trajectory Prediction. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 6478–87. https://doi.org/10.1109/cvpr52688.2022.00638
- 31. Yang B, He C, Wang P, Chan C-Y, Liu X, Chen Y. TPPO: A Novel Trajectory Predictor With Pseudo Oracle. IEEE Trans Syst Man Cybern, Syst. 2024;54(5):2846–59.
- 32. Jiang Z, Ma Y, Shi B, Lu X, Xing J, Gonçalves N. Social ntransformers: Low-quality pedestrian trajectory prediction. IEEE Transactions on Artificial Intelligence. 2024;5:5575–88.
- 33. Jiang Z, Qin C, Yang R, Shi B, Alsaadi FE, Wang Z. Social Entropy Informer: A Multi-Scale Model-Data Dual-Driven Approach for Pedestrian Trajectory Prediction. IEEE Trans Intell Transport Syst. 2025;26(10):16438–53.
- 34. Jiang Z, Yang R, Ma Y, Qin C, Chen X, Wang Z. Social Informer: Pedestrian Trajectory Prediction by Informer With Adaptive Trajectory Probability Region Optimization. IEEE Trans Cybern. 2026;56(1):15–28. pmid:41052188
- 35. Wen Y, Li Z, Xu P. Dynamic graph transformer for pedestrian potential trajectory prediction under the world perspective. Neurocomputing. 2026;664:132125.