Fig 1.
Social interactions induce multi-modal future uncertainty.
Given the same observed history, different interaction outcomes can lead to multiple plausible future trajectories (red dashed), while only one is realized as the ground truth (blue). This illustration motivates the need to predict multiple modes to capture the eventual outcome.
Fig 2.
Overview of the proposed LGCMT framework.
The target trajectory Xi and neighbor trajectories are embedded and processed by a local–global collaborative encoder, consisting of a Causal Temporal Encoder (CTE) with SCT-MSA for local dynamics and a Global Context Encoder (GCE) for global motion trends. A motion-pattern library is scored by the CLS head to select top-K modes, and a socially-aware non-autoregressive decoder (REG head) generates K future trajectory hypotheses in parallel.
Fig 3.
Detailed structure of the Sparse Causal Temporal Multi-head Self-Attention (SCT-MSA) module.
The mechanism restricts attention to a local sliding window of size Rwindow (shaded grey area), ensuring that the feature representation at time t depends only on the recent history [t − Rwindow, t]. This design enforces causality and reduces computational complexity compared to full self-attention.
Table 1.
Performance comparison on the ETH and UCY datasets. Values represent minADE/minFDE in meters. The ↓ symbol indicates that lower values are better. The best results are shown in bold.
Table 2.
Performance comparison on the stanford drone dataset (SDD). Prediction errors are reported as ADE/FDE in pixels. Values are averaged over the best of 20 predicted trajectories. Lower values are better.
Table 3.
Ablation study results on the ETH and UCY datasets. All values are minADE/minFDE in meters. The performance of the full model is highlighted in bold.
Table 4.
Comparison of Model Complexity and Inference Speed. Params are reported in millions (M), FLOPs in gigaflops (G), and inference time in milliseconds (ms). All models were evaluated on an NVIDIA RTX 4070 GPU.
Table 5.
Impact of hidden dimensions on model performance on the SDD dataset.
Table 6.
Robustness analysis. We report the Mean ± Standard Deviation over 5 independent runs. The Mean is reported to 2 decimal places, and the Standard Deviation to 3 decimal places to highlight the minimal variance.
Table 7.
Comparison of autoregressive (AR) and non-autoregressive (NAR) versions of LGCMT. Predictive performance is measured in ADE/FDE (meters), and inference time is in milliseconds (ms). Results for the superior NAR model are in bold.
Fig 4.
Sensitivity to the motion-pattern library size Nlib.
Bars report the average minADE/minFDE when a single Nlib is used for all scenes. Horizontal dashed lines denote the baseline obtained with scene-specific optimal Nlib. Performance is stable for , while per-scene tuning achieves the best accuracy.
Fig 5.
Sensitivity to the local history window size Rwindow in SCT-MSA.
Bars report the average minADE/minFDE when a single is used for all scenes. Horizontal dashed lines denote the baseline obtained with scene-specific optimal Rwindow. The model is robust to Rwindow, with stable performance around Rwindow = 4–5.
Fig 6.
Multi-modal trajectory predictions on ETH/UCY scenes.
(A) ETH, (B) HOTEL, (C) UNIV, (D) ZARA1, and (E) ZARA2. Observed trajectories are shown in green, ground-truth futures in blue, predicted trajectories are shown as red dashed lines, and the best prediction is highlighted in solid red.
Fig 7.
Qualitative comparison with a baseline method.
(A) ETH, (B) HOTEL, (C) UNIV, and (D) ZARA. Observations are shown in green and ground truth in blue. LGCMT is shown in red and the baseline is shown in orange.