Local causal dynamic integrated global mode guidance transformer network for pedestrian trajectory prediction

doi:10.1371/journal.pone.0347049

Fig 1.

Social interactions induce multi-modal future uncertainty.

Given the same observed history, different interaction outcomes can lead to multiple plausible future trajectories (red dashed), while only one is realized as the ground truth (blue). This illustration motivates the need to predict multiple modes to capture the eventual outcome.

More »

Expand

Fig 2.

Overview of the proposed LGCMT framework.

The target trajectory X_i and neighbor trajectories are embedded and processed by a local–global collaborative encoder, consisting of a Causal Temporal Encoder (CTE) with SCT-MSA for local dynamics and a Global Context Encoder (GCE) for global motion trends. A motion-pattern library is scored by the CLS head to select top-K modes, and a socially-aware non-autoregressive decoder (REG head) generates K future trajectory hypotheses in parallel.

More »

Expand

Fig 3.

Detailed structure of the Sparse Causal Temporal Multi-head Self-Attention (SCT-MSA) module.

The mechanism restricts attention to a local sliding window of size R_window (shaded grey area), ensuring that the feature representation at time t depends only on the recent history [t − R_window, t]. This design enforces causality and reduces computational complexity compared to full self-attention.

More »

Expand

Table 1.

Performance comparison on the ETH and UCY datasets. Values represent minADE/minFDE in meters. The ↓ symbol indicates that lower values are better. The best results are shown in bold.

More »

Expand

Table 2.

Performance comparison on the stanford drone dataset (SDD). Prediction errors are reported as ADE/FDE in pixels. Values are averaged over the best of 20 predicted trajectories. Lower values are better.

More »

Expand

Table 3.

Ablation study results on the ETH and UCY datasets. All values are minADE/minFDE in meters. The performance of the full model is highlighted in bold.

More »

Expand

Table 4.

Comparison of Model Complexity and Inference Speed. Params are reported in millions (M), FLOPs in gigaflops (G), and inference time in milliseconds (ms). All models were evaluated on an NVIDIA RTX 4070 GPU.

More »

Expand

Table 5.

Impact of hidden dimensions on model performance on the SDD dataset.

More »

Expand

Table 6.

Robustness analysis. We report the Mean ± Standard Deviation over 5 independent runs. The Mean is reported to 2 decimal places, and the Standard Deviation to 3 decimal places to highlight the minimal variance.

More »

Expand

Table 7.

Comparison of autoregressive (AR) and non-autoregressive (NAR) versions of LGCMT. Predictive performance is measured in ADE/FDE (meters), and inference time is in milliseconds (ms). Results for the superior NAR model are in bold.

More »

Expand

Fig 4.

Sensitivity to the motion-pattern library size N_lib.

Bars report the average minADE/minFDE when a single N_lib is used for all scenes. Horizontal dashed lines denote the baseline obtained with scene-specific optimal N_lib. Performance is stable for , while per-scene tuning achieves the best accuracy.

More »

Expand

Fig 5.

Sensitivity to the local history window size R_window in SCT-MSA.

Bars report the average minADE/minFDE when a single is used for all scenes. Horizontal dashed lines denote the baseline obtained with scene-specific optimal R_window. The model is robust to R_window, with stable performance around R_window = 4–5.

More »

Expand

Fig 6.

Multi-modal trajectory predictions on ETH/UCY scenes.

(A) ETH, (B) HOTEL, (C) UNIV, (D) ZARA1, and (E) ZARA2. Observed trajectories are shown in green, ground-truth futures in blue, predicted trajectories are shown as red dashed lines, and the best prediction is highlighted in solid red.

More »

Expand

Fig 7.

Qualitative comparison with a baseline method.

(A) ETH, (B) HOTEL, (C) UNIV, and (D) ZARA. Observations are shown in green and ground truth in blue. LGCMT is shown in red and the baseline is shown in orange.

More »

Expand