Figures
Abstract
Recent advances in deep learning have substantially improved short-term metro passenger flow prediction. However, existing approaches often inadequately model the dependency of outflow on inflow and typically rely on predefined station correlation graphs, which limits modeling flexibility and representational capacity. To address these issues, this study decomposes the influence of inflow on outflow into short-term and long-term temporal components and proposes a dual-temporal inflow–outflow dependency model (DTIOD). DTIOD adopts an asymmetric feature extraction scheme to encode inflow and outflow sequences according to their distinct roles in forecasting. Instead of using predefined station correlation graphs or explicit spatial modules, the model employs a dual-branch cross-attention mechanism to capture inflow–outflow dependencies across multiple temporal scales, thereby enabling implicit learning of spatial correlations. In addition, sample-level origin–destination (OD) matrices are incorporated as additive attention biases to embed prior inter-station relationships and guide attention allocation. The outflow features are adaptively fused with the long-term and short-term inflow effect representations through learnable weights, and final predictions are generated by a fully connected layer. Experiments on the Hangzhou metro dataset show that DTIOD reduces RMSE (root mean squared error), MAE (mean absolute error), and WMAPE (weighted mean absolute percentage error) by 10.75%, 11.60%, and 6.84%, respectively, compared with the strongest baseline, while completing training within 70 seconds. These results demonstrate that DTIOD achieves a favorable balance between predictive accuracy and computational efficiency, indicating its practical applicability.
Citation: Hu W, Huang Z, Cai J, Zhao X (2026) Dual-temporal inflow–outflow dependency modeling for short-term metro outflow prediction. PLoS One 21(4): e0347131. https://doi.org/10.1371/journal.pone.0347131
Editor: Guangyin Jin, National University of Defense Technology, CHINA
Received: February 15, 2026; Accepted: March 29, 2026; Published: April 21, 2026
Copyright: © 2026 Hu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The minimal dataset underlying the findings of this study is publicly available on Figshare: https://doi.org/10.6084/m9.figshare.31810948.
Funding: This research was supported by the National Natural Science Foundation of China under Grants Nos. 51978082 (Zhongxiang Huang) and 52302389 (Jianrong Cai). Please state what role the funders took in the study. If the funders had no role, please state: The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
Accurate short-term passenger flow forecasting is essential for improving metro operational efficiency, supporting refined passenger flow guidance, and responding promptly to emergencies. However, metro passenger flows are influenced by diverse factors such as land use patterns, meteorological conditions, public activities, and holiday effects, exhibiting significant nonlinearity and instability. As a result, achieving reliable short-term metro flow prediction remains a major challenge in intelligent transportation research.
Metro passenger flow forecasting has progressed from traditional statistical models to machine learning techniques, and more recently to deep learning-based frameworks. Statistical approaches such as Kalman filters [1] and autoregressive integrated moving average [2] capture basic temporal dependencies but exhibit limited capacity in modeling nonlinear patterns and complex disturbances. Machine learning methods, including decision trees [3], bayesian models [4], and support vector machines [5], improve nonlinear representation but rely heavily on manual feature engineering. As data availability and computational capacity continue to increase, deep learning methods have shown clear advantages in complex signal modeling, wireless sensing, and high dimensional pattern recognition because of their strong nonlinear representation capability [6–8]. Consequently, deep learning has become a dominant paradigm in metro passenger flow forecasting research [9]. Among these approaches, spatiotemporal fusion models [10] have been widely adopted as a central framework in both theoretical studies and practical applications. Building on this foundation, recent research has mainly explored three categories of mechanisms to further improve predictive performance.
The first category of mechanisms targets the restructuring and enrichment of spatiotemporal inputs to forecasting models. Temporal preprocessing strategies such as series decomposition [11] and reorganization [12,13] are commonly adopted to regulate nonstationary, nonlinear, and multi-scale dynamics in passenger flow sequences, offering temporally more tractable representations. In the spatial domain, interstation dependencies are encoded through the construction and refinement of station correlation graphs [14], including passenger flow similarity graphs [15], origin—destination (OD) flow graphs [16–18], point of interest (POI) similarity graphs [16,19], dynamic graphs [20–22], and other multi-relational graphs [23,24]. In addition, auxiliary variables such as calendar attributes, meteorological conditions, and operational schedules are often incorporated to model exogenous influences on passenger flow evolution, further improving predictive accuracy and generalization [25].
The second category of mechanisms focuses on improving spatiotemporal feature representation and interaction modeling through the refinement of temporal learning modules, spatial learning modules, and their integration strategies. In temporal modeling, convolutional operators and attention mechanisms are widely adopted to capture long-term trends and short-term dynamics in passenger flow [26–28]. For spatial feature learning, graph convolutional networks (GCNs) remain the dominant modeling paradigm. Variants such as the personalized enhanced GCN improve model responsiveness under peak-flow conditions and alleviate the noise sensitivity of conventional GCNs [29]. Moreover, recent studies increasingly emphasize coordinated spatiotemporal learning with multi-scale coupling, enabling temporal dependencies and spatial structures to be learned jointly and resulting in more robust predictive performance [30,31].
The third category of methods is grounded in fundamental metro travel behavior and explicitly models spatiotemporal dependencies between inflow and outflow. In metro systems, outflow at a station originates from earlier inflow at other stations, establishing a clear temporal order in which entry precedes exit and resulting in cross-station spatial coupling. Consequently, treating these flows as independent multi-variate inputs fails to capture their intrinsic dependency structures [11,24,31,32]. To address this limitation, several studies incorporate passenger flow formation mechanisms that characterize the propagation pathways and dynamic evolution of passenger movements from inflow to outflow, thereby enhancing interpretability while improving forecasting accuracy [16,17,33,34].
Overall, existing studies have substantially advanced metro passenger flow forecasting through input enhancement, structural optimization, and mechanism-driven modeling. Nevertheless, several critical limitations remain, which can be summarized as follows.
- (1) Excessive reliance on predefined station correlation graphs. Most approaches depend on such graphs to model spatial relationships in passenger flows. However, simple adjacency graphs struggle to capture the multi-dimensional latent connections among stations, whereas overly dense graph structures often introduce redundant information that degrades performance. Although multi-graph and dynamic graph formulations partially alleviate these issues, their performance improvements are typically achieved at the cost of substantially increased model complexity. Moreover, graph construction is highly sensitive to data incompleteness and noise, which undermines the robustness of spatial relationship learning. To date, relatively few studies have investigated latent interstation passenger flow interactions without relying on predefined correlation graphs [35].
- (2) Insufficient characterization of inflow—outflow relationships. The relationship between inflow and outflow in metro systems exhibits both short-term dynamics and long-term evolutionary patterns. Short-term dependencies arise from the passenger travel process. During a typical journey, passengers pass through entry, travel, and exit stages; therefore, outflow at a given time step is partly influenced by inflow in preceding time steps. In addition to these short-term effects, inflow and outflow also display pronounced long-term patterns. Commuting behavior illustrates this phenomenon. Passengers often exit near their workplaces during the morning peak and later enter the same stations during the evening peak to return home. In contrast, stations located in residential areas exhibit the opposite inflow and outflow pattern. Such regular travel behavior creates a relatively stable correspondence between inflow and outflow at the same station over longer time horizons and produces persistent long-term dependencies. However, most existing studies rely on OD data to characterize passenger flow relationships [16,17,33,34]. These data are derived from complete travel records and describe passenger movements between origins and destinations within short time intervals. Consequently, they are inherently limited to modeling short-term transformation patterns between stations.
- (3) Existing strategies for exploiting OD data struggle to balance dynamic representational capacity with computational efficiency. One line of research constructs static OD correlation graphs from OD statistics aggregated at the dataset level [16–18]. While these graphs capture overall movement patterns, they fail to reflect the dynamic evolution of passenger flows. Another line constructs OD graphs at each time step to model instantaneous spatial distributions [13,22,33,34], at the expense of substantially increased computational cost. Consequently, achieving a principled trade-off between temporal responsiveness and computational tractability remains an open and practically significant challenge.
To address these limitations, this study examines the evolutionary patterns of inflow and outflow and decomposes the influence of inflow into short-term and long-term temporal components. On this basis, a dual-temporal inflow–outflow dependency model (DTIOD) is developed for metro outflow prediction. Since the dependency between inflow and outflow varies across temporal scales, the model aims to capture the multi-scale influence of historical inflow on future outflow. Attention mechanisms have recently been widely adopted for multi-scale temporal modeling and cross-modal feature integration because they enable dynamic weighting to capture complex dependencies [36–39]. Building on this capability, DTIOD employs a two-branch multi-head attention architecture within a unified embedding space to model both long-term and short-term dependencies from historical inflow to future outflow. A sample-level OD flow matrix is further incorporated as a learnable bias in the short-term branch, improving sensitivity to dynamic flow patterns while maintaining a lightweight model structure.
The main contributions of this paper are summarized as follows.
- (1) A cross-sequence dual-temporal modeling framework with a dual-branch attention mechanism is proposed. Unlike existing approaches that primarily rely on OD data to characterize short-term passenger interactions among stations, the proposed framework explicitly models the evolutionary relationships between inflow and outflow across long-term and short-term temporal scales. This design enables the outflow representation of each station to selectively aggregate multi-scale inflow information from the entire network.
- (2) In contrast to conventional spatiotemporal fusion approaches, DTIOD directly learns multi-scale dependencies between inflow and outflow without relying on predefined station correlation graphs or explicit spatial modules. A sample-level OD bias is incorporated into the attention computation to encode prior passenger flow relationships among stations. This design enables the model to capture short-term flow variations while maintaining computational efficiency.
- (3) Experiments on real-world datasets demonstrate that DTIOD outperforms mainstream approaches in both predictive accuracy and training efficiency. Because the model relies solely on inflow and outflow data, it requires substantially lower data completeness than typical spatiotemporal fusion models, which improves its practical applicability for metro passenger flow forecasting.
The remainder of the paper is organized as follows. Section 2 describes the DTIOD architecture and its core modules. Section 3 presents the datasets and model configurations. Section 4 reports the experimental results and provides comparative analyses. Section 5 concludes the paper and discusses future research directions.
2 Problem description and model construction
2.1 Problem description
This study is motivated by travel behavior patterns that drive passenger flow dynamics across both short-term and long-term temporal scales. Accordingly, we decompose the influence of inflow on outflow into short-term and long-term components and explicitly model the resulting dual-scale temporal dependencies to enable accurate prediction of metro outflow in a future time interval. To formalize the problem setting, five key concepts are defined below.
Definition 1 (inflow and outflow time series).
Let and
denote the inflow and outflow time series extracted from Automatic Fare Collection (AFC) data. The outflow time series
is defined as
where denotes the number of stations, and
is the total number of time steps after aggregating the data at a specified time granularity. The outflow vector at time step
, denoted as
, is defined as:
where represents the outflow volume at station
during time step
.
The inflow time series , the inflow vector
, and the element
are defined similarly.
Definition 2 (time-step-based OD tensor).
Based on AFC records with precise entry and exit timestamps, a time-step OD matrix is constructed for each time step
. It is formed by identifying all passenger trips that terminate within
and retrieving their corresponding entry stations from AFC trajectories. The resulting matrix characterizes the distribution of passengers completing their trips during
, irrespective of their entry times. Compared with OD matrices constructed from the entire dataset or training set, the time‑step‑based OD matrix preserves finer‑grained spatiotemporal information. Since it is derived solely from trips ending at the current time step, it avoids the future information leakage that can occur when generating prediction samples from entry‑based OD matrices. Stacking these matrices for each time step forms a time‑step‑based OD tensor, formally defined as follows.
Definition 3 (sample).
A complete sample for the DTIOD model comprises an outflow sequence, a long-term inflow sequence, a short-term inflow sequence, and a corresponding OD matrix. These components are extracted from the original outflow and inflow time series using sliding windows with lengths ,
, and
, respectively.
The outflow sequence sample is defined as:
The short-term inflow sequence sample is defined as:
The long-term inflow sequence sample is defined as:
Where and
. During sample construction, the alignment constraint
is enforced to incorporate the most recent inflow information while avoiding future inflow leakage.
OD data are derived from complete individual travel records and inherently reflect short-term passenger movements between origins and destinations. Accordingly, each OD matrix sample is designed to encode spatiotemporal station-level associations within a short time window, enabling effective exploitation of OD information. Specifically, for each short-term inflow sequence sample , all OD matrices within the interval
are temporally aggregated, and their sum is used as the corresponding OD matrix sample
, as defined in Eq. (7). This aggregation provides a unified representation of OD flows over the short-term window, capturing the overall travel intensity between stations during the corresponding period.
Notably, the OD matrix sample is not constructed as a temporal sequence. Instead, it is integrated as an input feature alongside the short-term inflow sequence. This design preserves the salient temporal information in the OD data while avoiding the computational overhead of explicit sequential OD representations.
For sample , the model produces an outflow prediction. The corresponding ground‑truth outflow is given by Eq. (8).
Definition 4 (objective equation).
The outflow prediction equation is defined as follows:
Where denote the proposed DTIOD model and Θ its learnable parameters, and
represents the one-step-ahead prediction for sample
.
2.2 Model construction
The DTIOD model comprises a feature extraction module, a dual-temporal cross-attention module, and a fusion and prediction module. The feature extraction module contains three parallel branches that encode the outflow, short-term inflow, and long-term inflow sequences into high-dimensional feature representations. The cross-attention module consists of two multi-head attention mechanisms that model the effects of short-term and long-term inflows on the target outflow, thereby capturing multi-scale dependencies without relying on station correlation graphs or explicit spatial modules. The representations from the outflow branch and the cross-attention module are then adaptively fused and passed to a prediction head to generate the final outflow forecasts. The overall architecture is illustrated in Fig 1.
In the model implementation, let denote the batch size. The model takes four input tensors: the outflow tensor
, the short-term inflow tensor
, the long-term inflow tensor
, and the OD tensor
.
2.2.1 Feature extraction module.
The feature extraction module learns temporal representations of the target outflow and encodes inflow sequences for subsequent dependency modeling. As accurate outflow prediction is the primary objective, the outflow encoder is designed to effectively capture complex temporal dynamics. In contrast, the inflow encoder plays an auxiliary role by supplying Key–Value representations for the cross-attention mechanism. Accordingly, it focuses on producing embeddings that are feature-dimension aligned with the outflow representations, rather than independently modeling detailed inflow dynamics.
Using identical architectures for inflow and outflow feature extraction typically forces a trade-off between representational capacity and computational efficiency. Overly simple designs may fail to capture the complex temporal dynamics of outflow, whereas more expressive ones substantially increase model complexity and computational cost. To address this issue, we adopt an asymmetric feature extraction scheme that tailors modeling capacity to the distinct functional roles of inflow and outflow, thereby preserving predictive performance while maintaining a lightweight architecture.
Specifically, TimesNet [40] is employed to encode outflow temporal features. Based on the TimesBlock architecture, it transforms one-dimensional time series into multi-periodic two-dimensional representations and integrates Inception-style multi-scale convolutional structures to capture temporal variations across different periodic scales. In contrast, inflow sequences are encoded by a lightweight two‑layer multilayer perceptron (MLP) [41], which provides sufficiently expressive Key–Value representations for the subsequent cross‑attention module while incurring minimal computational overhead.
- (1) Outflow Branch
The input outflow tensor is first permuted to
and normalized along the temporal dimension using Z-score standardization. The normalized sequence is then projected into a
-dimensional embedding space, yielding the embedded representation
. A linear projection layer subsequently expands the temporal dimension from
to
, where
in this study. The resulting tensor is fed into a stack of TimesBlocks to capture complex temporal variations.
Within each TimesBlock, the input is transformed along the temporal dimension using a Fast Fourier Transform (FFT) to obtain its amplitude spectrum. The spectrum is averaged over the batch and embedding dimensions, and the frequency components with the highest energy are selected and mapped to their corresponding dominant time periods. For each identified period, the temporal axis is reorganized according to the period length, producing a two-dimensional representation. The resulting periodic representations are processed by an Inception-style multi-scale convolution module, which extracts fine-grained patterns within individual periods as well as broader variations across periods. The convolution outputs associated with different periods are weighted and fused using softmax-normalized spectral energies as period-specific importance coefficients. The fused representation is then passed through a linear projection and integrated into the residual connection, a design that strengthens feature preservation and promotes stable gradient propagation. Before being forwarded to the next TimesBlock, Layer Normalization is applied to ensure consistent feature distributions across layers.
The outflow temporal modeling stage outputs a feature sequence . From this sequence, the feature vector at the target prediction step,
, is extracted to construct the Query. This vector is linearly projected to an intermediate representation
and reshaped into
, resulting in one Query vector for each station.
- (2) Inflow Branch
The inflow branch comprises short-term and long-term components that share an identical architecture but operate at different temporal scales. Given an input tensor , where
, each sample is first normalized along the temporal dimension using Z-score standardization. The normalized tensor is reshaped into
, treating the inflow series of each station as an independent temporal sequence. These sequences are projected into a
-dimensional feature space through a two-layer MLP, producing embeddings that are channel‑aligned with the corresponding outflow features. The mapping is defined as follows:
The mapped features are reshaped to obtain and
. These tensors are then transformed by separate linear layers followed by layer normalization to produce the corresponding Key and Value representations, as defined in Eqs. (12)–(13).
2.2.2 Dual-temporal cross-attention module.
The cross-attention module comprises two independent multi-head attention mechanisms, denoted as Attention-L and Attention-S, as shown in Fig 2. Each mechanism uses the outflow features as the Query. Attention-L uses long-term inflow features as the Key and Value, enabling the outflow Query to aggregate long-term contextual information and encode semantic dependencies relevant to the target outflow. In contrast, Attention-S uses the short-term inflow features as Key and Value, incorporating a learnable scaling OD tensor as an additive bias to the attention logits. This approach effectively integrates short-term flow dynamics and prior spatiotemporal relationships into the attention mechanism.
- (1) Attention-L
Let denote the number of attention heads in Attention-L, such that the dimension of each head is
. For head
, the corresponding projection matrices are denoted as
. The mapping for head
is computed as follows:
The output of the attention head is given by:
Where represents the dot-product attention logits for the head
,
is the scaling factor, and softmax(·) is the activation function.
The outputs of all attention heads are concatenated and projected to produce the final output of Attention-L.
Where .
- (2) Attention-S
Let denote the number of attention heads in Attention-S, such that the dimension of each head is
. For head
, the corresponding projection matrices are denoted as
. The mapping for head
is computed as follows:
In contrast to Attention-L, Attention-S introduces the OD tensor as an additive bias in the attention logits, scaled by a learnable scalar
. To stabilize the attention computation, the OD tensor is normalized using Eqs. (18)–(19).
Each OD matrix within the tensor
undergoes an element-wise log1p transformation:
Where represents the passenger flow from station
to station
in sample
.
For each sample , the resulting matrix
is then normalized using min-max scaling based on its sample-specific maximum and minimum values, ensuring that all entries fall within the interval [0, 1].
Where The normalized matrices are then stacked to form the processed OD tensor
.
The attention logits for head are expressed as:
Here, is a learnable scalar that adaptively modulates the contribution of the OD bias to the attention logits. It balances the bias against the Query-Key similarity, ensuring that OD information enhances the attention distribution without dominating it. The exp(⋅) function guarantees the scaling factor remains positive, thereby preserving the directionality of the bias.
The output of the attention head is given by:
The outputs of all attention heads are concatenated and projected to produce the final output of Attention-S.
Where .
2.2.3 Fusion and prediction module.
The feature fusion module adaptively integrates three feature streams, namely ,
, and
, using learnable weighting coefficients. Specifically,
is the station-level query vector extracted from the outflow temporal feature sequence, representing the contextualized outflow state at the target prediction time step.
and
correspond to the enhanced representations obtained by attending to short-term and long-term inflow features, respectively, thereby capturing their multi-scale influences on the target outflow.
Given the heterogeneous numerical distributions and scales of these feature streams, a shared layer normalization is first applied, producing the normalized representations ,
, and
. A learnable parameter vector
is then passed through a Softmax function to generate adaptive fusion weights:
The final fused representation is computed as the weighted sum of the normalized features:
This design enables the model to adaptively regulate the contribution of passenger flow features from different sources to the final prediction.
The fused representation is fed into a linear prediction head
to generate the normalized passenger flow estimate
. The estimate is subsequently re-denormalized using the mean and standard deviation recorded during the outflow-branch normalization stage, restoring it to the original scale and yielding the final station-level passenger flow prediction
.
3 Experimental data and model configuration
3.1 Description and partition of the dataset
The performance of DTIOD is evaluated using real-world AFC data collected from the Hangzhou metro system between January 1 and January 25, 2019. The dataset covers four metro lines (Lines 1, 2, 4, and 9) and includes 81 stations, indexed from 0 to 80. Owing to missing entry and exit records, the passenger flow at Station 54 appears as zero in the dataset.
The raw AFC data are systematically cleaned to mitigate random errors arising during data acquisition. The cleaning procedures include: (i) removing records outside operating hours from 5:30–23:30; (ii) discarding incomplete trips that contain only an entry or only an exit record; (iii) eliminating logically inconsistent records, such as those in which the entry time is later than the exit time; and (iv) excluding trips with durations exceeding four hours. This criterion follows the Hangzhou Metro fare policy, which defines a valid trip as one completed within four hours. After data cleaning, a total of 28,868,674 valid and complete trip records remain.
Valid trip records are aggregated to a time granularity of 15 minutes to construct station-level inflow and outflow time series. With a daily operating duration of 18 hours, the system produces 72 time steps per day. Over the 25-day study period, this yields consecutive time steps indexed from 0 to 1799. The resulting data form matrices of size
, where
denotes the number of stations. The dataset is split chronologically into training, validation, and test sets using ratios of 0.72, 0.08, and 0.20, respectively. Accordingly, the index ranges are 0–1295 for training, 1296–1439 for validation, and 1440–1799 for testing, corresponding to the periods from 1–18 January, 19–20 January, and 21–25 January 2019.
3.2 Model configuration and evaluation metrics
The model requires four input tensors, for which the temporal window lengths ,
, and
must be specified. Here,
is set to 72, corresponding to one full day of metro operation, to capture the daily periodicity and time‑varying patterns of outflow while avoiding an excessively long window that would reduce the effective training sample size.
The values of and
are determined according to the temporal influence of historical inflow on future outflow. Analysis of the training data indicates that 96.83% of inflow-to-outflow travel times fall within 60 minutes, suggesting that short-term effects are largely confined to this range. At a 15‑minute granularity, this 60‑minute window corresponds to four time steps, and thus
. In contrast, the long-term influence of inflow on outflow exhibits more complex temporal patterns that are not well captured by descriptive statistics. The value of
is therefore selected through empirical analysis.
Assume that the maximum temporal span over which historical inflow at station can influence future outflow at station
is
. In other words, inflow observed at time step
can affect outflow at most up to time step
. Accordingly, the outflow at time
is influenced by inflow occurring within the interval
, where
and
may be identical or distinct.
Let and
denote the inflow and outflow time series of stations
and
, respectively. Removing the first
time steps from
results in the truncated sequence
. Likewise, removing the last
time steps from
leads to the corresponding truncated inflow sequence
produces a set of paired inflow and outflow sequences with different temporal spans, enabling a systematic analysis of the temporal range over which inflow influences outflow. To avoid information leakage and ensure an unbiased evaluation, this paired-sequence construction is performed exclusively on the training set, with no access to validation or test data.
To investigate potential long-term relationships between inflow and outflow, is set to cover an entire week of metro operations, taking values from 0 to 503. The Pearson correlation between
and
is computed to quantify the variation in their statistical dependence with respect to
, as defined in Eq. (25).
Where and
represent the mean values of the sequences
and
, respectively.
reflects the correlation between historical inflow at station
and future outflow at station
.
Several stations are randomly selected to examine the association patterns between inflow and outflow sequences using the Pearson correlation coefficient defined above. Fig 3 illustrates the results for stations 0, 7, 9, and 15. Each subplot depicts four curves, showing how the Pearson correlation between the outflow of the target station and the inflow of four stations varies with . Although the correlation profiles differ across stations, all exhibit a pronounced periodic structure with a dominant period of approximately 72 time steps. This observation suggests that the long-term influence of inflow on outflow is strongly characterized by daily periodicity. Accordingly,
is set to 72.
The configuration of model components and training hyperparameters is summarized below. In the outflow branch, the embedding dimension of the DataEmbedding layer is set to 128, with a dropout rate of 0.2. The temporal encoder comprises two stacked TimesBlocks. Each Inception-style convolution module employs three convolution kernels, with the channel dimension set to 64. The number of key periods is fixed at 2. The two inflow branches share the same architecture. Their feature extraction modules consist of a two-layer MLP, where the first layer uses the ReLU activation function and the second layer adopts GELU. The hidden dimension is set to 128, and a dropout rate of 0.2 is applied between the two layers to improve generalization.
The long-term cross-attention module (Attention-L) employs 8 attention heads with a per-head dimension of 16, while the short-term module (Attention-S) uses 4 heads with a per-head dimension of 32. The learnable scalar , which controls the scaling of the OD bias, is initialized to 1.11. The parameter vector
, which obtains adaptive fusion weights for the three feature streams via softmax, is initialized as [0.0, 1.0, 1.2].
The model is trained using the AdamW optimizer with a weight decay of 1e-05. The mean squared error (MSE) loss, defined in Eq. (26), is used for training. Gradient clipping with a threshold of 1.0 is applied to prevent exploding gradients. Different learning rates are assigned to different modules. The outflow branch uses an initial learning rate of 0.001. The inflow branches, the cross-attention module, and the learnable parameters and
adopt a learning rate scaled by a factor
, which is set to 0.9 relative to the base learning rate. This slight reduction helps stabilize training and allows the primary outflow branch to guide the optimization process. A ReduceLROnPlateau scheduler is applied during training. If the validation root mean squared error (RMSE), defined in Eq. (27), does not improve for three consecutive epochs, the learning rate is reduced by half, with a minimum value of 1e-06. The batch size is set to
, and early stopping with a patience of 10 epochs is employed to mitigate overfitting.
Predictive performance is evaluated using RMSE, mean absolute error (MAE), and weighted mean absolute percentage error (WMAPE), which are defined in Equations (27)–(29).
3.3 Baseline models
To objectively evaluate the predictive performance of DTIOD, we establish a multilevel comparison framework comprising baseline models from four methodological categories: statistical methods, traditional machine learning models, deep temporal models, and spatiotemporal fusion approaches. These baselines are selected based on their structural relevance to DTIOD and their compatibility with the Hangzhou metro dataset. Detailed model configurations are provided below.
- (1) Statistical and traditional machine learning models.
HA (historical average) uses the historical mean passenger flow of each station as the prediction and requires neither model training nor hyperparameter tuning, serving as a basic statistical baseline. RF (random forest) [42] and SVR (support vector regression) [43] are selected as representative machine learning baselines, which model nonlinear input–output relationships through ensemble learning and kernel-based methods, respectively. Both methods remain stable when the sample size is small or the feature dimension is limited. In the experiments, each model uses an input sequence length of 4. RF is implemented with 100 trees. SVR employs a radial basis function kernel, with and
. Comparing DTIOD with HA, RF, and SVR quantifies its performance gains over conventional non‑deep learning methods.
- (2) Deep learning temporal models.
Metro passenger flow typically exhibits pronounced peak periods together with stable daily temporal patterns. Accurately capturing short-term fluctuations and long-term periodicity is therefore essential for reliable prediction. LSTM (long short-term memory) [44] and GRU (gated recurrent unit) [45] are adopted as representative deep learning temporal baselines because of their ability to model multi-scale temporal dependencies through gating mechanisms and their extensive use in transportation forecasting studies. In the experiments, both models take a historical sequence of length 36 as input and use a hidden dimension of 128. A fully connected layer with 256 units and a dropout rate of 0.2 is applied before the output layer. TimesNet is a recent temporal model that has demonstrated strong predictive performance and also serves as the temporal encoder in the outflow branch of DTIOD. Using TimesNet as a baseline enables a direct assessment of the performance gains brought by the proposed dual-temporal inflow–outflow dependency modeling. For a fair comparison, the hyperparameters of TimesNet are kept identical to those used in the outflow branch of DTIOD. Comparisons with LSTM, GRU, and TimesNet therefore provide insight into the advantages of DTIOD over purely temporal modeling approaches.
- (3) Spatiotemporal fusion models.
Metro networks inherently exhibit graph structure. As a result, spatiotemporal fusion models based on graph convolution have been widely applied in passenger flow prediction studies. However, the Hangzhou Metro dataset used in this study contains missing passenger flow records at a station. Such data incompleteness may disrupt feature propagation in models that rely on predefined graph structures. To examine the impact of this issue, several representative graph-based models are selected as comparison baselines, including STGCN [46] (spatio-temporal graph convolutional networks), PVGCN [15] (physical-virtual collaboration graph network), PMR-GCN [29] (regularized spatial-temporal graph convolutional networks for metro passenger flow prediction), and MR-STN [33] (spatio-temporal network framework based on multi-relational). These models enable an evaluation of graph structured modeling approaches under the data conditions of the Hangzhou Metro dataset. Among these models, STGCN is a classical fusion framework that jointly models spatial and temporal dependencies through graph and temporal convolutions. PVGCN integrates physical and virtual network topologies to characterize passenger flow dynamics from multiple graph perspectives. PMR-GCN combines personalized graph convolution with multi-head self-attention to model complex spatiotemporal dependencies. MR-STN explicitly models the coupling between inflow and outflow and extracts local and global features through an enhanced STFL (spatio-temporal feature learner) module.
Comparisons with STGCN, MR-STN, PVGCN, and PMR-GCN enable a systematic evaluation of the predictive performance of DTIOD as a graph-free model. In particular, comparisons with MR-STN and PVGCN provide insight into different strategies for utilizing OD information. MR-STN employs time-step-level OD data to characterize the relationship between inflow and outflow, whereas PVGCN relies on globally aggregated static OD data to construct a station correlation graph. Considering these approaches together helps clarify how different forms of OD information influence model performance and facilitates an assessment of the relative effectiveness of DTIOD within this research line.
In addition, the graph-free model SDT-GRU [35] is included as an additional baseline. SDT-GRU adopts a sequence-to-sequence architecture built upon stacked DT-GRU layers, where each DT-GRU integrates a dual-branch Transformer module into the GRU to capture spatiotemporal dependencies without relying on predefined graphs. Comparing DTIOD with SDT-GRU, which follows a similar graph-free modeling paradigm, helps reveal performance differences within this framework. All spatiotemporal fusion models considered in this study adopt the parameter settings recommended in their original publications.
All models are configured for single-step passenger flow prediction. Deep learning models are trained with an early stopping strategy (patience = 10) to mitigate overfitting. All implementations are based on the PyTorch framework. Each model is evaluated over five independent runs, and the reported results are averaged across runs.
4 Case analysis
This section evaluates the DTIOD model along four dimensions: overall performance, training efficiency, prediction accuracy, and ablation analysis.
4.1 Overall performance analysis
Table 1 summarizes the overall performance of DTIOD and all baselines on the full test set. For each deep learning model, it also reports the average training time per epoch and the total training time.
In general, predictive accuracy is expected to follow a hierarchical pattern, with spatiotemporal fusion models outperforming purely temporal models, deep learning approaches surpassing traditional machine learning methods, and the latter exceeding statistical baselines. The results in Table 1, however, deviate from this pattern. While the simple statistical baseline (HA) performs worst, traditional machine learning models (RF and SVR) achieve higher accuracy than several deep learning approaches. In addition, purely temporal models, including LSTM, GRU, and TimesNet, outperform most spatiotemporal fusion models across the evaluated metrics.
A joint analysis of model architectures and experimental results reveals a consistent trend that greater reliance on graph-based spatial modeling corresponds to degraded predictive performance. Although STGCN, MR-STN, PVGCN, and PMR-GCN achieve strong results on several public metro datasets, their performance deteriorates markedly on the Hangzhou Metro dataset. This decline stems from the complete absence of entry and exit records at Station 54, which disrupts adjacency-based feature aggregation in graph convolutional layers. Because graph convolution relies on coherent feature propagation across connected nodes, missing node features hinder spatial information diffusion. This structural deficiency introduces noise into the learned representations, weakens spatial feature extraction, and ultimately reduces forecasting accuracy. These results suggest that spatiotemporal fusion models heavily dependent on predefined graph structures exhibit limited robustness when node features are incomplete or data quality is inconsistent.
Among all baseline methods, SDT-GRU achieves the highest predictive accuracy. Its strong performance is primarily attributed to its ability to capture global spatial correlations and temporal dependencies without relying on a predefined station correlation graph, thereby alleviating the adverse effects of missing passenger flow records. Following the same graph-free modeling paradigm, DTIOD further outperforms SDT-GRU across all evaluation metrics, achieving average reductions of 10.75%, 11.60%, and 6.84% in RMSE, MAE, and WMAPE, respectively. Beyond predictive accuracy, DTIOD also exhibits clear advantages in training time. As indicated by the last two columns of Table 1, its average iteration time per epoch is only marginally higher than that of purely temporal models, while its total training time is substantially lower than that of other spatiotemporal fusion models and even shorter than that of the GRU baseline. Although DTIOD performs well in terms of accuracy and training time, a more detailed examination of its training efficiency, complexity and convergence behavior remains worthwhile.
4.2 Comparison of model complexity and training efficiency
Considering the structural characteristics and predictive performance of the baseline models, five representative approaches are selected for comparison in terms of model complexity and training efficiency: TimesNet, SDT-GRU, STGCN, MR-STN, and PVGCN. Model complexity is evaluated using the number of parameters and FLOPs (floating-point operations). These models represent different modeling paradigms and information utilization strategies. TimesNet and SDT-GRU belong to graph-free architectures, whereas STGCN, MR-STN, and PVGCN are graph-based models. Fig 4 presents the comparison between DTIOD and the selected baselines, illustrating the evolution of validation errors with respect to training epochs and cumulative training time. The corresponding statistics for model parameters and FLOPs, computed with a batch size of 1, are reported in Table 2.
Graph-based models generally exhibit lower training efficiency, although substantial differences exist among them. Examination of Tables 1 and 2 together with Fig 4 reveals several patterns. STGCN adopts a relatively simple graph architecture and therefore has the lowest computational complexity among the graph-based models. Its training time is comparable to that of the graph-free models TimesNet and DTIOD. However, the error curve of STGCN gradually flattens during the later stages of training, indicating limited potential for further improvement. Among the graph-based models, MR-STN and PVGCN employ multi-graph modeling structures and consequently have much higher computational complexity than STGCN. In particular, PVGCN requires 68.9589 seconds for a single training epoch, which is approximately 54 times longer than that of STGCN. Its parameter count and FLOPs are also the largest among all compared models. By contrast, MR-STN has moderate computational complexity, but its error decreases relatively slowly during training, resulting in low training efficiency. The model ultimately requires 1261.9984 seconds to complete training and produces the largest prediction error among all comparison methods.
Graph-free models generally demonstrate advantages in training efficiency, although differences remain among them. TimesNet, as a purely temporal model, requires a total training time of 50.1019 seconds. Its computational complexity is only higher than that of STGCN and remains relatively low among all compared models. However, its predictive accuracy is limited and shows signs of overfitting. TimesNet achieves lower validation error than SDT-GRU but performs worse on the test set, indicating weaker generalization ability. By contrast, although SDT-GRU also follows a graph-free modeling paradigm, its computational complexity is second only to that of PVGCN. The model adopts a sequence-to-sequence architecture, which restricts parallel computation. As a result, its training time per epoch and total training time are substantially higher than those of other graph-free models and even exceed those of some graph-based models such as STGCN. These results indicate that the use of graph structures is not the sole factor determining model training efficiency.
As a graph-free architecture, DTIOD achieves a favorable trade-off between computational cost and predictive accuracy. The total training time is 66.3849s, which is only 24.53% higher than that of TimesNet, while the predictive accuracy surpasses all baseline models. The computational complexity remains low and is only slightly higher than that of TimesNet. During training, the error curve shows moderate fluctuations in the intermediate stage, after which the RMSE decreases rapidly and converges to a stable level, indicating efficient optimization. The superior performance of DTIOD stems from the coordinated design of its architecture and information utilization strategy. First, DTIOD inherits the multiscale temporal modeling capability of TimesNet and further introduces inflow–outflow dependency modeling, which significantly enhances predictive accuracy. The substantial performance gap between DTIOD and TimesNet observed in the experiments confirms the importance of explicitly modeling inflow and outflow dependencies. Second, the overall architecture remains simple because the model does not rely on graph structures or spatial modules, which helps maintain low computational complexity. Third, DTIOD incorporates OD information through a sample-level OD matrix used as an attention bias, which enables the integration of OD dynamics with limited computational overhead. In contrast, MR-STN models the OD matrix at the time-step level and therefore introduces substantially higher computational complexity. PVGCN instead constructs station correlation graphs based on static OD information, which limits its ability to capture dynamic passenger flow patterns.
4.3 Prediction performance analysis
This section evaluates DTIOD’s prediction performance across temporal and station-level dimensions, relative to five baseline models. Fig 5 presents the prediction errors at each time step on 25 January 2019. All models exhibit a bimodal error distribution over time. STGCN, MR-STN, and PVGCN show the largest error fluctuations, whereas SDT-GRU and TimesNet demonstrate comparatively greater stability. DTIOD yields the smoothest error trajectory and achieves lower errors than the baseline models across most time steps.
Table 3 further reports the prediction errors of each model during peak and off-peak periods, measured by RMSE. The results show that all models produce larger errors during peak periods than during off-peak periods. This pattern mainly arises because passenger flows during peak hours are both larger in magnitude and more volatile. The resulting increase in prediction error therefore reflects higher demand intensity and greater uncertainty in passenger flows rather than a systematic bias of model parameters toward specific time periods. In terms of individual model performance, DTIOD achieves the lowest prediction error during both the off-peak period and the morning peak period, outperforming all comparison models overall. During the evening peak period, its error is slightly higher than that of TimesNet. Notably, the prediction error of TimesNet during the morning peak period is substantially higher and is lower only than that of STGCN. Considering results across all time periods, DTIOD maintains consistently high prediction accuracy under different temporal scenarios, demonstrating strong robustness and generalization capability.
At the station level, DTIOD’s performance is assessed at four representative stations. Station 33 is a terminal station on Line 9, characterized by typical peak-hour commuting patterns. Station 46 serves as an interchange between Lines 2 and 4 and exhibits more complex passenger-flow dynamics. Station 15, the busiest multimodal hub in the network, handles a diverse range of passenger flows. Finally, Station 55, located adjacent to Station 54 where data are missing, provides a test of DTIOD’s robustness in handling local data incompleteness.
Fig 6 compares the predicted and actual outflow at the four representative stations. As shown in Figs 6(c) and 6(d), DTIOD achieves high accuracy at stations such as 46 and 55, where passenger flow patterns are relatively regular, with the predicted curves closely matching the ground-truth trajectories. The results for Station 55 further demonstrate that missing data at a neighboring station do not significantly impair model performance. In contrast, predictions for Stations 15 and 33, shown in Figs 6(a) and 6(b), are less accurate. These stations exhibit frequent and pronounced fluctuations in passenger flow, resulting in more complex temporal patterns that increase the difficulty of modeling rapidly changing behaviors. Nevertheless, DTIOD successfully captures the overall trends and responds effectively to sudden changes, demonstrating its robustness and adaptability in complex operating environments.
4.4 Ablation study
As described in Section 2.2, DTIOD comprises two core components. TimesNet is responsible for extracting temporal features from outflow, while a cross-attention mechanism models the multi-scale temporal influence of inflow on outflow. Ablating this cross-attention mechanism reduces DTIOD to the TimesNet baseline; their comparative predictive performance has been thoroughly analyzed in earlier sections and is not revisited here. To systematically evaluate the contribution of the remaining components in DTIOD, we conduct a series of ablation experiments by modifying the model architecture and feature fusion strategy. The corresponding results are summarized in Table 4.
The ablation variants are defined as follows.
- (1) DTIOD-Only-Attention-L retains only the long-scale attention output
for prediction, with the short-scale attention output
and the query representation
removed.
- (2) DTIOD-Only-Attention-S retains only
, excluding
and
from the prediction process.
- (3) DTIOD-Only-Attention removes the query representation
and directly fuses
and
for prediction.
- (4) DTIOD-No-Attention-L removes the long-scale attention module and fuses the short-scale feature
with the query representation
for prediction.
- (5) DTIOD-No-Attention-S removes the short-scale attention module and combines
with
before prediction.
For experiments (3) to (5), the feature fusion strategy is as follows:
Here, is a learnable parameter, constrained to lie between 0 and 1 after the sigmoid transformation. Its initial value is set to 0.5.
The ablation results in Table 4 confirm that removing any core module degrades performance, validating the necessity of each component in DTIOD. Variants DTIOD-Only-Attention-L and DTIOD-Only-Attention-S produce comparable errors, which are substantially higher than those of other configurations. This result indicates that a single attention branch cannot sufficiently capture the characteristics of outflow. Although DTIOD-Only-Attention achieves slightly improved accuracy, its performance remains limited, highlighting the critical role of the Query component in providing a temporal representation of the target prediction step. In comparison, DTIOD-No-Attention-L and DTIOD-No-Attention-S outperform the single-branch variants, demonstrates that retaining the query representation while removing only one attention branch is more effective than relying on a single inflow attention branch alone. Notably, DTIOD-No-Attention-S exhibits higher errors than DTIOD-No-Attention-L, suggesting that the short-term inflow attention branch has a greater contribution to accurate forecasting.
Overall, the ablation studies provide systematic evidence for the effectiveness of the Attention-L, Attention-S, and Query components in DTIOD. The results indicate that combining multi-scale temporal dependency modeling between inflow and outflow with explicit temporal representations at the target prediction step is crucial for improving outflow forecasting accuracy. Moreover, a comparison with the results in Table 1 shows that although removing individual components leads to some performance degradation, all DTIOD variants consistently outperform every baseline model evaluated in this study. This consistent advantage demonstrates the structural robustness of the proposed framework.
4.5 Parameter sensitivity analysis
To evaluate the sensitivity of DTIOD to hyperparameter choices, twelve key parameters are examined, each with six candidate values. The model dimensions include as the embedding dimension,
as the convolution channel dimension in the outflow branch,
as the hidden dimension in the long-term inflow branch, and
as the hidden dimension in the short-term inflow branch. These parameters are tested over {16, 32, 64, 128, 256, 512}. Architectural parameters include
as the number of convolution kernels in the outflow Inception block,
as the number of TimesBlocks in the outflow branch, and
as the number of key periods in the outflow branch. These parameters are evaluated over {1, 2, 3, 4, 5, 6}. The attention head numbers
and
are selected from {1, 2, 4, 8, 16, 32}. The scaling factor
is tested over {0.5, 0.7, 0.9, 1.0, 1.2, 1.5}, while
is evaluated over {0.89, 0.90, 1.00, 1.10, 1.11, 1.20}. The learnable parameter vector
is initialized with six configurations: [0, 0, 0], [1.2, 1, 0], [1, 1.2, 0], [1.2, 0, 1], [0, 1.2, 1], and [0, 1, 1.2].
The candidate values of are chosen to lie close to 1, as this parameter governs the initial OD bias scaling strength. If
is excessively large, the OD bias dominates the Attention-S computation and weakens the attention structure derived from query–key similarity. If
is too small, the bias contributes little to attention weight allocation. Therefore,
requires finer calibration than the other parameters.
When evaluating an individual parameter, all others are fixed at the default settings defined in Section 3.2. The experimental results are summarized in Table 5. The second column reports the RMSE on the test set under the six candidate settings for each parameter, while the third column presents the corresponding relative changes of RMSE. The relative change (RC) is used to quantify the overall variation in prediction error and is defined as follows.
Here, denotes the maximum RMSE,
denotes the minimum RMSE, and
represents the mean RMSE across all candidate settings.
As shown in Table 5, DTIOD maintains stable predictive performance across the examined parameter ranges. Even under less favorable settings, it consistently outperforms all baseline models, demonstrating strong robustness. Further analysis shows that the model is more sensitive to ,
,
,
,
and
, as reflected by the relatively large changes in RMSE. Increasing these parameters enhances the feature representation capacity of the model and reduces prediction error. However, beyond an appropriate range, the resulting increase in model complexity raises the risk of overfitting, which may cause the prediction accuracy to plateau or decline. In contrast, the variations in RMSE associated with
,
, and
remain consistently small, indicating that the model is less sensitive to these training hyperparameters. Therefore, the predictive performance of DTIOD does not critically depend on their exact values.
5 Conclusion
This paper presents DTIOD, a model for short-term metro outflow prediction that integrates long-term and short-term inflow dependencies. Unlike mainstream spatiotemporal fusion models that rely on explicit spatial modules, DTIOD employs a cross‑branch attention mechanism to implicitly learn inter‑station interactions, thereby enhancing robustness under limited spatial priors or incomplete data. In addition, sample-level OD matrices are incorporated as attention biases, providing a temporal granularity intermediate between global static and time-step-specific representations. This design enables the model to capture dynamic flow relationships while maintaining controlled computational cost.
Experiments on Hangzhou Metro AFC data demonstrate that DTIOD consistently outperforms mainstream spatiotemporal fusion models in prediction accuracy while retaining high training efficiency. Notably, these performance gains are achieved without predefined graph structures, external temporal labels, or auxiliary data, highlighting the model’s data efficiency and practical robustness. These advantages indicate strong potential for real-time short-term forecasting in urban rail systems, although performance at stations with highly volatile peak-hour flows remains a topic for further investigation.
Future work can be pursued in several directions. First, incorporating meteorological variables, temporal contextual information, and other heterogeneous data sources may further improve performance in environments where external conditions strongly influence travel demand. Second, extending the inflow–outflow dependency framework to related transportation forecasting tasks would help clarify its generality and applicability. Finally, integrating information from spatially adjacent alternative travel modes, including bike-sharing usage, public transport availability, and bus ridership, may further enhance short-term metro passenger flow prediction.
Supporting information
S2 Fig. Dual-temporal cross-attention module for inflow–outflow dependency modeling.
https://doi.org/10.1371/journal.pone.0347131.s002
(JPG)
S3 Fig. Pearson correlation between inflow and outflow sequences at four stations.
https://doi.org/10.1371/journal.pone.0347131.s003
(JPG)
S4 Fig. Training dynamics of DTIOD and baselines.
https://doi.org/10.1371/journal.pone.0347131.s004
(JPG)
S5 Fig. Prediction error comparison between DTIOD and baselines.
https://doi.org/10.1371/journal.pone.0347131.s005
(JPG)
S6 Fig. Predicted vs. actual outflow at four representative stations.
https://doi.org/10.1371/journal.pone.0347131.s006
(JPG)
References
- 1. Cai L, Zhang Z, Yang J, Yu Y, Zhou T, Qin J. A noise-immune Kalman filter for short-term traffic flow forecasting. Physica A: Statistical Mechanics and its Applications. 2019;536:122601.
- 2. Wen K, Zhao G, He B, Ma J, Zhang H. A decomposition-based forecasting method with transfer learning for railway short-term passenger flow in holidays. Expert Systems with Applications. 2022;189:116102.
- 3. Li P, Wu W, Pei X. A separate modelling approach for short-term bus passenger flow prediction based on behavioural patterns: A hybrid decision tree method. Physica A: Statistical Mechanics and its Applications. 2023;616:128567.
- 4. Roos J, Bonnevay S, Gavin G. Short-term urban rail passenger flow forecasting: A dynamic Bayesian network approach. In: 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA), 2016. 1034–9.
- 5. Liu S, Yao E. Holiday Passenger Flow Forecasting Based on the Modified Least-Square Support Vector Machine for the Metro System. J Transp Eng, Part A: Systems. 2017;143(2).
- 6. Delamou M, Bazzi A, Chafii M, Amhoud EM. Deep Learning-based Estimation for Multitarget Radar Detection. In: 2023 IEEE 97th Vehicular Technology Conference (VTC2023-Spring), 2023. 1–5.
- 7. Naoumi S, Bazzi A, Bomfin R, Chafii M. Deep learning-enabled angle estimation in bistatic ISAC systems. In: 2023 IEEE Globecom Workshops (GC Wkshps), 2023. 854–9.
- 8. Njima W, Bazzi A, Chafii M. DNN-Based Indoor Localization Under Limited Dataset Using GANs and Semi-Supervised Learning. IEEE Access. 2022;10:69896–909.
- 9. Liu Y, Liu Z, Jia R. DeepPF: A deep learning based architecture for metro passenger flow prediction. Transportation Research Part C: Emerging Technologies. 2019;101:18–34.
- 10. Li H, Fu W, Zhang H, Liu W, Sun S, Zhang T. Spatio–temporal graph hierarchical learning framework for metro passenger flow prediction across stations and lines. Knowledge-Based Systems. 2025;311:113132.
- 11. Zhang J, Chen F, Cui Z, Guo Y, Zhu Y. Deep Learning Architecture for Short-Term Passenger Flow Forecasting in Urban Rail Transit. IEEE Trans Intell Transport Syst. 2021;22(11):7004–14.
- 12. Huang H, Mao J, Lu W, Hu G, Liu L. DEASeq2Seq: An attention based sequence to sequence model for short-term metro passenger flow prediction within decomposition-ensemble strategy. Transportation Research Part C: Emerging Technologies. 2023;146:103965.
- 13. Huang H, Mao J, Kang L, Lu W, Zhang S, Liu L. A dynamic graph deep learning model with multivariate empirical mode decomposition for network‐wide metro passenger flow prediction. Computer-Aided Civil and Infrastructure Engineering. 2024;39(17):2596–618.
- 14. Fang H, Chen C-H, Hwang F-J, Chang C-C, Chang C-C. Metro Station functional clustering and dual-view recurrent graph convolutional network for metro passenger flow prediction. Expert Systems with Applications. 2024;247:122550.
- 15. Liu L, Chen J, Wu H, Zhen J, Li G, Lin L. Physical-Virtual Collaboration Modeling for Intra- and Inter-Station Metro Ridership Prediction. IEEE Transactions on Intelligent Transportation Systems. 2022;23:3377–91.
- 16. Zeng J, Tang J. Combining knowledge graph into metro passenger flow prediction: A split-attention relational graph convolutional network. Expert Systems with Applications. 2023;213:118790.
- 17. Hu S, Chen J, Zhang W, Liu G, Chang X. Graph transformer embedded deep learning for short-term passenger flow prediction in urban rail transit systems: A multi-gate mixture-of-experts model. Information Sciences. 2024;679:121095.
- 18. Zhan S, Cai Y, Xiu C, Zuo D, Wang D, Chun Wong S. Parallel framework of a multi-graph convolutional network and gated recurrent unit for spatial–temporal metro passenger flow prediction. Expert Systems with Applications. 2024;251:123982.
- 19. Sun J, Ye X, Yan X, Wang T, Chen J. Multi-step peak passenger flow prediction of urban rail transit based on multi-station spatio-temporal feature fusion model. Systems. 2025;13:96.
- 20. Wang J, Zhang Y, Wei Y, Hu Y, Piao X, Yin B. Metro Passenger Flow Prediction via Dynamic Hypergraph Convolution Networks. IEEE Trans Intell Transport Syst. 2021;22(12):7891–903.
- 21. Diao C, Zhang D, Liang W, Jiang M, Li K. A Novel Attention-Based Dynamic Multi-Graph Spatial-Temporal Graph Neural Network Model for Traffic Prediction. IEEE Trans Emerg Top Comput Intell. 2025;9(2):1910–23.
- 22. Li R, Zhao L, Tang J, Tang S, Hao Z. Multi-dimensional time-dependent dynamic graph neural network for metro passenger flow prediction. Appl Intell. 2025;55:442.
- 23. Zhang J, Chen F, Guo Y, Li X. Multi‐graph convolutional network for short‐term passenger flow forecasting in urban rail transit. IET Intelligent Trans Sys. 2020;14(10):1210–7.
- 24. Zhao C, Li X, Shao Z, Yang H, Wang F. Multi-featured spatial-temporal and dynamic multi-graph convolutional network for metro passenger flow prediction. Connection Science. 2022;34(1):1252–72.
- 25. Ma C, Zhao M. Urban rail transit passenger flow prediction using large language model under multi-source spatiotemporal data fusion. Physica A: Statistical Mechanics and its Applications. 2025;675:130823.
- 26. Ma C, Zhang B, Li S, Lu Y. Urban rail transit passenger flow prediction with ResCNN-GRU based on self-attention mechanism. Physica A: Statistical Mechanics and its Applications. 2024;638:129619.
- 27. Zong X, Guo J, Liu F, Yu F. TSTA-GCN: trend spatio-temporal traffic flow prediction using adaptive graph convolution network. Sci Rep. 2025;15(1):13449. pmid:40251289
- 28. Ou J, Sun J, Zhu Y, Jin H, Liu Y, Zhang F, et al. STP-TrellisNets+: Spatial-Temporal Parallel TrellisNets for Multi-Step Metro Station Passenger Flow Prediction. IEEE Trans Knowl Data Eng. 2022;:1–14.
- 29. Gao C, Liu H, Huang J, Wang Z, Li X, Li X. Regularized Spatial–Temporal Graph Convolutional Networks for Metro Passenger Flow Prediction. IEEE Trans Intell Transport Syst. 2024;25(9):11241–55.
- 30. Zhang Y, Chen Y, Wang Z, Xin D. TMFO-AGGRU: A Graph Convolutional Gated Recurrent Network for Metro Passenger Flow Forecasting. IEEE Transactions on Intelligent Transportation Systems. 2024;25:2893–907.
- 31. Wu J, Li X, He D, Li Q, Xiang W. Learning spatial-temporal dynamics and interactivity for short-term passenger flow prediction in urban rail transit. Appl Intell. 2023;53:19785–806.
- 32. He Y, Li L, Zhu X, Tsui KL. Multi-Graph Convolutional-Recurrent Neural Network (MGC-RNN) for Short-Term Forecasting of Transit Passenger Flow. IEEE Trans Intell Transport Syst. 2022;23(10):18155–74.
- 33. Zhao Y, Lin Y, Zhang Y, Wen H, Liu Y, Wu H. Traffic inflow and outflow forecasting by modeling intra- and inter-relationship between flows. IEEE Transactions on Intelligent Transportation Systems. 2022;23:20202–16.
- 34. Wang T, Song J, Zhang J, Tian J, Wu J, Zheng J. Short-term metro passenger flow prediction based on hybrid spatiotemporal extraction and multi-feature fusion. Tunnelling and Underground Space Technology. 2025;159:106491.
- 35. Yang Q, Xu X, Wang Z, Yu J, Hu X. Are graphs and GCNs necessary for short-term metro ridership forecasting?. Expert Systems with Applications. 2024;254:124431.
- 36. Gu C, Tan Y, Yin X, Li X, Yang Y, Lv Y. Enhanced Air Pollution Prediction via Adam-Optimized Multi-Head Attention and Hybrid Deep Learning. ICCK Transactions on Intelligent Systematics. 2026;3(1):11–20.
- 37. Haider AU, Gazis A, Zahoor F. SemanticBlur: Semantic-Aware Attention Network with Multi-Scale Feature Refinement for Defocus Blur Detection. ICCK Transactions on Intelligent Systematics. 2026;3(1):21–31.
- 38. Hassan MZ, Gazis A, Khan A, Ghazanfar Z. Learning Cross-Modal Collaboration via Pyramid Attention for RGB Thermal Sensing in Saliency Detection. ICCK Transactions on Sensing, Communication, and Control. 2026;3(1):1–14.
- 39. Khan A, Shah HA. Context Refinement with Multi-Attention Fusion for Saliency Segmentation Using Depth-Aware RGBD Sensing. ICCK Transactions on Sensing, Communication, and Control. 2026;3(1):27–38.
- 40. Wu H, Hu T, Liu Y, Zhou H, Wang J, Long M. TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis. arXiv. 2023.
- 41. Singh J, Banerjee R. A Study on Single and Multi-layer Perceptron Neural Network. 2019 3rd International Conference on Computing Methodologies and Communication (ICCMC). Erode, India: IEEE; 2019. 35–40.
- 42. Yu B, Wang H, Shan W, Yao B. Prediction of Bus Travel Time Using Random Forests Based on Near Neighbors. Computer-Aided Civil and Infrastructure Engineering. 2018;33(4):333–50.
- 43. Li Y, Wang X, Sun S, Ma X, Lu G. Forecasting short-term subway passenger flow under special events scenarios using multiscale radial basis function networks. Transportation Research Part C: Emerging Technologies. 2017;77:306–28.
- 44. Yang X, Xue Q, Yang X, Yin H, Qu Y, Li X, et al. A novel prediction model for the inbound passenger flow of urban rail transit. Information Sciences. 2021;566:347–63.
- 45. Shu W, Cai K, Xiong NN. A Short-Term Traffic Flow Prediction Model Based on an Improved Gate Recurrent Unit Neural Network. IEEE Trans Intell Transport Syst. 2022;23(9):16654–65.
- 46. Yu B, Yin H, Zhu Z. Spatio-Temporal Graph Convolutional Networks: A Deep Learning Framework for Traffic Forecasting. In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, 2018. 3634–40.