Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Feature-enhanced iTransformer: A two-stage framework for high-accuracy long-horizon traffic flow forecasting

  • Yonghui Duan,

    Roles Conceptualization, Formal analysis, Resources, Writing – review & editing

    Affiliation Department of Civil Engineering, Henan University of Technology, Zhengzhou, Henan, China

  • Yucong Zhang ,

    Roles Investigation, Methodology, Software, Validation, Visualization, Writing – original draft

    3080318694@qq.com

    Affiliation Department of Civil Engineering, Henan University of Technology, Zhengzhou, Henan, China

  • Xiang Wang,

    Roles Conceptualization, Funding acquisition, Resources, Writing – review & editing

    Affiliation Department of Civil Engineering, Zhengzhou University of Aeronautics, Zhengzhou, Henan, China

  • Yuan Xue,

    Roles Data curation, Funding acquisition, Methodology

    Affiliation Department of Civil Engineering, Zhengzhou University of Aeronautics, Zhengzhou, Henan, China

  • Zirong Wang,

    Roles Data curation, Supervision

    Affiliation Department of Civil Engineering, Zhengzhou University of Aeronautics, Zhengzhou, Henan, China

  • Di Wu

    Roles Funding acquisition, Supervision

    Affiliation Department of Civil Engineering, Zhengzhou University of Aeronautics, Zhengzhou, Henan, China

Abstract

Accurate and reliable long-horizon traffic flow prediction is a cornerstone of modern Intelligent Transportation Systems (ITS), yet it remains challenging due to the complex, non-linear, and dynamic spatio-temporal dependencies inherent in traffic data. While recent Transformer-based models have shown promise, they are typically end-to-end systems that couple feature extraction and sequence prediction, which can limit their ability to fully leverage multi-faceted domain information. To address this, we propose a two-stage framework, the Feature-Enhanced iTransformer (FE-iTransformer), founded on an extract-and-enhance philosophy. The framework first employs a comprehensive Feature Enhancement Module (FEM) to distill a global context vector from spatio-temporal dynamics, periodic patterns, and temporal context—without relying on a predefined graph structure. Subsequently, an innovative per-step feature enhancement mechanism uses this global vector to enrich the original input sequence, yielding an information-rich representation that is then processed by a strong iTransformer backbone for final prediction. The effectiveness of FE-iTransformer is validated through extensive experiments: ablation studies on two classic datasets (Freeway and Urban) provide compelling evidence for the efficacy of the two-stage design, demonstrating that introducing FEM significantly improves the pure iTransformer backbone; supplementary experiments on the large-scale PEMS08 benchmark further confirm scalability and long-horizon performance, reducing Mean Absolute Error (MAE) by 19.1% over the vanilla backbone in the 120-minute forecasting task. Importantly, this study targets no-graph/weak-graph settings and does not aim to surpass graph-prior models; rather, it offers a deployment-ready, graph-free alternative when the roadway graph is unavailable or unreliable.

1. Introduction

The rapid pace of global urbanization has led to unprecedented challenges in urban mobility, with traffic congestion emerging as a critical bottleneck affecting economic productivity, environmental quality, and public safety [1,2]. Intelligent Transportation Systems (ITS) have become a cornerstone of modern smart city initiatives, leveraging advanced information and communication technologies to monitor, analyze, and manage complex traffic networks [3,4]. At the heart of ITS lies the task of accurate and reliable traffic flow prediction, which provides the foundational data for a myriad of applications, including dynamic traffic signal control [5], proactive congestion mitigation [6], and intelligent route guidance [7]. By anticipating future traffic states, transportation authorities can optimize network efficiency, reduce travel times, and enhance overall road safety, making traffic flow prediction are search area of paramount importance [8].

Despite these advancements, achieving high-precision traffic flow prediction is an inherently challenging task due to the complex and dynamic nature of traffic data. Several core technical challenges must be addressed. First, traffic flow exhibits intricate spatio-temporal dependencies [9]. The traffic state at a specific location is not only determined by its own historical patterns (temporal dependency) but is also strongly influenced by the conditions of upstream, downstream, and even functionally similar but geographically distant road segments (spatial dependency) [10]. Second, traffic systems are characterized by high non-linearity and dynamics, frequently affected by non-recurrent events such as traffic accidents, extreme weather, and public holidays, which can cause abrupt changes in flow patterns [11]. Third, traffic data is embedded with multiple periodic patterns, such as daily rush hours and weekly variations between weekdays and weekends, which are crucial for long-term forecasting [12].

To address these challenges, methods have evolved from statistical models to deep learning. Early approaches, including statistical models like ARIMA and machine learning methods like Support Vector Regression (SVR), often struggle to capture the complex non-linear relationships inherent in traffic data [2]. In recent years, deep learning models have become the dominant paradigm. Methods based on Recurrent Neural Networks (RNNs), such as Long Short-Term Memory (LSTM), excel at modeling temporal sequences but often fall short in effectively capturing long-range dependencies and complex spatial correlations [13,14]. Convolutional Neural Networks (CNNs) capture local spatial features but are limited by fixed receptive fields, hindering their adaptability to non-Euclidean road networks [15]. Graph Convolutional Networks (GCNs) have shown great promise by explicitly modeling the topological structure of road networks [16]. However, their performance heavily relies on a pre-defined, often static, adjacency matrix, which may not capture the dynamic and time-varying nature of spatial dependencies in real-world traffic [17,18].

More recently, Transformer-based models have demonstrated remarkable success in capturing long-range dependencies in sequential data [19]. However, standard Transformers, when directly applied to traffic forecasting, may overlook the local context and sequential nature of time series data. Furthermore, their quadratic computational complexity with respect to sequence length remains a significant challenge for long-term prediction tasks [20]. This has spurred research into more efficient and specialized Transformer variants, yet a comprehensive solution that effectively integrates multi-faceted traffic characteristics without relying on a predefined graph structure remains an open challenge.

To address the aforementioned limitations, we propose a novel framework, the Feature-Enhanced iTransformer (FE-iTransformer). Our approach is founded on a two-stage “extract-and-enhance” philosophy. First, a graph-free Feature Enhancement Module (FEM) distills a global context vector from spatio-temporal, periodic, and temporal data. This vector then enriches the input sequence via a per-step mechanism. Finally, an iTransformer backbone processes the enhanced sequence for forecasting. This design effectively decouples complex feature extraction from long-range dependency modeling.

The main contributions of this paper are summarized as follows:

  1. We propose a novel two-stage framework, named Feature-Enhanced iTransformer (FE-iTransformer), for high-accuracy traffic flow forecasting, which decouples the complex task into distinct stages of feature enhancement and enhanced prediction.
  2. We design a comprehensive Feature Enhancement Module (FEM) that effectively captures multi-faceted traffic characteristics by concurrently modeling spatio-temporal dynamics, periodic patterns, and temporal context without relying on a predefined graph structure.
  3. We introduce an innovative per-step feature enhancement mechanism that leverages the global context vector distilled by the FEM to enrich the original input sequence, significantly improving the model’s performance, especially in long-horizon forecasting scenarios.
  4. Extensive experiments demonstrate the effectiveness and robustness of our approach. In-depth ablation studies on our primary datasets (Freeway and Urban) rigorously validate the efficacy of each component of our design, while supplementary experiments on the large-scale PEMS08 benchmark confirm the framework’s superior performance and scalability in a challenging, real-world forecasting environment.

This work targets no-graph or weak-graph prior settings that frequently arise in practice. We therefore study a graph-free two-stage pipeline in which multi-source temporal context is extracted and injected before variable-wise attention. We do not claim superiority over graph-prior models under graph-rich conditions; rather, we provide a deployment-ready alternative when constructing or maintaining a reliable adjacency is costly or impractical. In line with this scope, our empirical comparisons focus on non-graph baselines (linear, recurrent, and transformer families). Effectiveness is demonstrated by vertical comparisons against the iTransformer backbone and horizontal comparisons against representative non-graph baselines, with long-horizon settings (up to 120-min) emphasized. Supplementary results on PEMS08 corroborate scalability.

The remainder of this paper is organized as follows. Section 2 reviews the related work. Section 3 details the methodology of our proposed model. Section 4 presents the experimental results and analysis. Finally, Section 5 concludes the paper.

2. Related work

The methodologies for traffic flow prediction have evolved substantially, moving from classical statistical models to complex deep learning frameworks that can better navigate the intricate dynamics of modern transportation systems [2,21]. Early approaches were dominated by statistical methods like Auto-Regressive Integrated Moving Average (ARIMA), which models time series based on their historical patterns [22], and machine learning techniques such as Support Vector Regression (SVR), which leverages kernel methods to handle non-linearities [23]. While foundational, these models often struggle to capture the highly complex and non-linear spatio-temporal dependencies inherent in traffic data without extensive manual feature engineering [24].

The rise of deep learning introduced a new era for traffic forecasting. Recurrent Neural Networks (RNNs) and their advanced variants, notably the Long Short-Term Memory (LSTM) network introduced by Hochreiter & Schmidhuber [25], became the primary tools for modeling temporal sequences. Their ability to capture long-term dependencies made them a popular choice in various hybrid frameworks, such as the work by Wang et al.[14], which combined LSTM with Bayesian Neural Networks for uncertainty quantification. However, a core limitation of RNNs is their inherent inability to directly model spatial relationships. To address this, researchers began incorporating Convolutional Neural Networks (CNNs). For instance, Zhang et al.[26] demonstrated a CNN-based approach that treated spatio-temporal data as a 2D matrix. While effective for extracting local spatial patterns, the fixed receptive fields of CNNs limit their capacity to model the non-Euclidean topology of road networks and dynamic long-range spatial correlations, a challenge that recent hybrid models like CCNN-former from Liu et al.[27] and the convolutional transformer by Sattarzadeh et al.[28] aim to resolve by combining CNNs with Transformer architectures. Beyond these established architectures, recent advancements have introduced novel paradigms to address data scarcity and modeling complexity. For instance, generative adversarial networks [29] and diffusion models [30], along with virtual-real fusion simulations [31], have been effectively employed to address data scarcity and synthesize high-fidelity representations. Furthermore, emerging architectures like Kolmogorov-Arnold Networks [32] have demonstrated superior capability in capturing intricate non-linear dynamics compared to traditional MLPs.

To explicitly model the graph structure of road networks, Graph Neural Networks (GNNs) emerged as a milestone technology. The seminal work of Yu et al.[33] on Spatio-Temporal Graph Convolutional Networks (STGCN) provided a powerful framework for simultaneously capturing spatial and temporal features. Building on this, the field has rapidly advanced. Chen et al.[16] incorporated attention mechanisms into dynamic GCNs to improve traffic speed prediction, while Mu et al.[9] developed a hierarchical GNN to capture multi-scale spatio-temporal semantics. Similarly, Ji et al.[10] proposed a hybrid framework combining depthwise separable convolutions with GNNs for efficient multi-scale feature extraction. Despite their success, a fundamental challenge for most GNN-based models remains their reliance on a pre-defined adjacency matrix. As pointed out by Diao et al.[17] and Zheng et al.[18], this static graph often fails to represent the dynamic nature of traffic correlations, leading to research on adaptive and dynamic graph generation techniques. Concurrently, some non-graph spatiotemporal optimization approaches have also emerged to address these limitations. For instance, Chen et al.[34] utilized simulation-based optimization for sensor placement, and Hu et al.[35] developed multi-branch spatiotemporal attention networks for predictive maintenance. These studies demonstrate the potential of capturing complex dependencies and optimizing structural configurations without relying on pre-defined static graphs.

Specifically for sequence modeling tasks such as traffic flow forecasting, the Transformer architecture [36] has demonstrated exceptional capability with its self-attention mechanism, offering an unparalleled ability to capture long-range dependencies. This has led to a surge of Transformer-based models in traffic forecasting. To address the computational complexity of the original Transformer for long sequences, Wu et al.[37] proposed Autoformer, which uses an Auto-Correlation mechanism. More recently, Nie et al.[38] developed PatchTST, which improves efficiency and performance by segmenting time series into patches. A significant innovation came from Liu et al.[39] with iTransformer, which inverts the attention mechanism to operate on the variable (sensor) dimension, a design particularly effective for multivariate time series and which serves as the backbone for our model. While these models achieve state-of-the-art results, they typically function as end-to-end systems, coupling feature extraction and prediction.

Through this review of existing literature, a clear research gap emerges: the lack of a framework that decouples multi-faceted feature extraction from long-range dependency modeling without relying on predefined graph structures. To fill this gap, we propose the Feature-Enhanced iTransformer (FE-iTransformer), a novel two-stage architecture designed to first extract a rich, global context representation and then use it to enhance the input sequence for a more robust and accurate prediction.

3. Methodology

This section elaborates on the technical architecture of our proposed model, the Feature-Enhanced iTransformer (FE-iTransformer). The detailed structure of each component will be presented in the following subsections.

3.1. Problem definition

The task of traffic flow forecasting aims to predict a sequence of future traffic states for a set of spatially distributed sensors, given their historical observations. The historical traffic flow data is represented as a tensor , where is the batch size, is the length of the historical input sequence (e.g., 32 time steps), and is the number of sensors. In this work, we focus on a univariate forecasting setting where each sensor provides a single feature (traffic flow).

To enhance long-horizon prediction accuracy, our model also utilizes two types of auxiliary information:

  1. Periodic Data (): A tensor , which captures historical daily and weekly patterns. Here, represents the two periodic sequences.
  2. Time Features (): A tensor which encodes contextual information from timestamps (e.g., hour of the day, day of the week). Here, .

Given the historical observations and the auxiliary inputs (), the objective is to predict the traffic flow for all ensors over the next time steps. The target sequence is denoted as . The entire forecasting task can thus be formulated as learning a mapping function , parameterized by , which maps the historical information to the future traffic sequence:

(1)

The goal is to optimize the parameters by minimizing the discrepancy between the predicted sequence and the ground truth sequence .

3.2. Overall architecture

To tackle the challenges of long-horizon traffic forecasting, we propose the Feature-Enhanced iTransformer (FE-iTransformer), a novel framework designed with a clear, two-module architecture founded on an “extract-and-enhance” philosophy. The overall architecture is illustrated in Fig 1.

The model is composed of two primary modules: the Feature Enhancement Module (FEM) and the Enhanced Prediction Module. The workflow begins with the raw traffic data , periodic data , and time features being fed into the FEM. This module processes the inputs through three parallel branches—a Spatio-temporal Branch, a Periodicity Branch, and a Temporal Context Branch—to distill a single, comprehensive Global Enhancement Feature Vector, denoted as . Subsequently, this vector is passed to the Enhanced Prediction Module, where it is used to enrich the original input sequence at each time step. Finally, this information-rich sequence is processed by a powerful iTransformer predictor to generate the final forecast. The detailed design of each component will be presented in the following sections.

3.3. Feature enhancement module (FEM)

The Feature Enhancement Module (FEM) serves as the feature extraction engine of the first stage. Its objective is to distill a compact, high-dimensional global context vector from the heterogeneous input data, including the historical traffic flow , periodic sequences , and temporal timestamps . To achieve this, the FEM employs a multi-branch architecture consisting of three parallel components.

3.3.1. Spatio-temporal branch.

To capture both local spatial correlations and global temporal dynamics from the raw traffic sequence , we design a hierarchical “Summarize-Broadcast-Reprocess” mechanism. This branch processes the input through four sequential steps:

Step 1: Local Feature Extraction. We first employ a 1D Convolutional Neural Network (CNN) to extract local spatial features. Treating the sensor dimension as channels, the convolution operation transforms the raw input into a latent feature sequence :

(2)

where denotes the activation function and is the hidden dimension within the branch.

Step 2: Initial Sequential Encoding & Summarization. The feature sequence is fed into a primary Long Short-Term Memory (LSTM) network to capture temporal dependencies, yielding a hidden state sequence . To distill the global context, we apply an attention-based pooling layer:

(3)(4)(5)

where serves as a summarized context vector representing the holistic state of the current observation window. and are learnable parameters.

Step 3: Broadcast and Reprocess. A critical innovation of our design is the Broadcast-Reprocess mechanism. The global context is broadcast (repeated) across the temporal dimension to form a sequence of length . This sequence is concatenated with the initial hidden states and fed into a secondary LSTM. This allows the model to re-examine the local temporal features under the guidance of the global context:

(6)

Step 4: Final Aggregation. Finally, the refined sequence is aggregated via another attention pooling layer to produce the final spatio-temporal feature vector :

(7)

3.3.2. Periodicity branch.

Traffic flow exhibits strong cyclical characteristics, most notably daily and weekly periodicities. To capture these recurring patterns while strictly adhering to temporal causality, this branch processes the periodic input tensor . Strictly implementing a Historical Value Retrieval strategy to prevent data leakage, we construct by retrieving specific past observations relative to the prediction start time : the daily sequence is retrieved from (i.e., exactly 24 hours ago), and the weekly sequence is from (i.e., exactly 7 days ago).

Shared-Weight Processing. Assuming that traffic evolution patterns are consistent across different time scales, we employ a weight-sharing strategy. Both and are processed independently by the same Bidirectional Gated Recurrent Unit (Bi-GRU) network with shared parameters :

(8)(9)

The resulting hidden sequences are compressed into context vectors and via attention pooling. These are then concatenated and projected to form the final periodic feature vector :

(10)

3.3.3. Temporal context branch.

This branch encodes explicit timestamp information from the tensor (e.g., hour of day). Since the immediate future is most strongly correlated with the current time context, we isolate the feature vector corresponding to the final time step of the input window. This vector is mapped to the latent space via a Multi-Layer Perceptron (MLP):

(11)

3.3.4. Learnable weighted fusion.

To adaptively integrate the features from the three branches, we employ a learnable weighted fusion mechanism. Let be a learnable parameter vector. The fusion weights are obtained via a Softmax function:

(12)

The final global enhancement vector is computed as the weighted sum:

(13)

This vector encapsulates the comprehensive multi-source information and is subsequently used to enrich the input sequence in the Enhanced Prediction Module.

3.4. Enhanced prediction module

The Enhanced Prediction Module constitutes the second stage of the FE-iTransformer framework. Its primary function is to leverage the rich, contextual information encapsulated in the Global Enhancement Feature Vector () to refine the original input sequence before making the final forecast. This module consists of two main components: a per-step feature enhancement mechanism and the iTransformer predictor.

3.4.1. Per-step feature enhancement.

This mechanism represents a core innovation of our architecture, designed to inject the distilled global context into each individual time step of the input sequence. The process unfolds as follows: First, the Global Enhancement Feature Vector is projected by a linear layer to align its dimensionality with the sensor space, resulting in a projected vector . Then, for each time step in the historical input sequence , the corresponding feature vector is concatenated with the entire projected global vector . This combined vector is then passed through a dedicated Fusion Network, which is a multi-layer perceptron (MLP), to produce the enhanced feature vector for that time step, :

(14)

By repeating this process for all time steps, we construct a new, information-rich sequence, the Enhanced Sequence . Each element in this new sequence is now explicitly aware of the global spatio-temporal context summarized by the FEM.

The adoption of a concatenation-based MLP for feature fusion, as opposed to elementary operations like element-wise addition, is a deliberate design choice to enhance representational capacity. While linear summation assumes a simple superposition of context and input, traffic dynamics often involve complex, non-linear dependencies between global trends and local states. The MLP architecture facilitates high-order feature interactions, allowing the model to adaptively recalibrate the importance of the global context vector relative to the local input at each time step. This capability ensures that the injected context serves as a dynamic reference rather than a static bias, thereby maximizing the information gain for the subsequent prediction backbone.

3.4.2. iTransformer predictor.

The final prediction is generated by a powerful iTransformer backbone [3 9], which is highly effective for long-sequence forecasting. The enhanced sequence serves as the primary input to the iTransformer backbone.

The input to the iTransformer encoder is formed by combining a value embedding of with a temporal embedding derived from the original time features . The core of the predictor is a multi-layer iTransformer encoder. As proposed in the original work, its self-attention mechanism critically operates across the variable (sensor) dimension rather than the temporal dimension, allowing it to effectively capture dependencies among the sensors across the entire sequence. The output from the encoder is then passed to a final linear projection layer, which acts as the prediction head, to produce the final forecast .

3.5. Training objective

The parameters of the FE-iTransformer model are optimized by minimizing a hybrid loss function that combines the strengths of two widely used regression metrics: the Mean Absolute Error (MAE) and the Root Mean Squared Error (RMSE). This composite loss leverages the robustness of MAE to potential outliers in traffic data, while also benefiting from the sensitivity of RMSE to large prediction errors.

The final loss function, , is a weighted sum of these two components, controlled by a hyperparameter . In our experiments, missing values in the ground truth are masked and ignored during the calculation of both metrics. The overall objective is to minimize the following function:

(15)

where is set to 0.7 to place a slightly greater emphasis on the MAE component, thereby promoting a more stable training process.

4. Experiments

To systematically evaluate the performance of the proposed FE-iTransformer model, we conduct a series of comprehensive experiments. This chapter is organized to present a multi-faceted validation of our framework. First, we detail the experimental settings, including the datasets, evaluation metrics, and baseline models. Second, we present our main results and in-depth analyses on our primary datasets (Freeway and Urban), where we establish the model’s state-of-the-art performance and conduct rigorous ablation studies to validate our core design principles. Third, we address the crucial question of scalability by presenting supplementary experiments on the large-scale PEMS08 benchmark. Finally, we provide qualitative analysis through visualizations to offer more intuitive insights into the model’s predictive behavior.

4.1. Experimental settings

This section details the comprehensive experimental setup designed to rigorously evaluate our proposed FE-iTransformer framework. We describe the datasets used for validation, the metrics for performance evaluation, the state-of-the-art baseline models for comparison, and the specific implementation details to ensure the reproducibility of our results.

4.1.1. Datasets.

Our empirical evaluation is conducted on three real-world benchmark datasets, carefully selected to validate our proposed framework from different perspectives, and all datasets are sourced from the Performance Measurement System (PeMS). Our primary experiments are conducted on two classic datasets, Freeway and Urban, whose moderate scale and distinct traffic characteristics provide an ideal environment for detailed model analysis and rigorous ablation studies. To further validate the scalability and generalization of our findings in a more complex setting, we also include a supplementary evaluation on the large-scale PEMS08 benchmark.

Freeway Dataset: This dataset contains traffic flow data collected from 7 sensors deployed on the SR99-S freeway in District 10, California. It is characterized by high-speed and relatively structured traffic patterns, serving as a standard benchmark for highway traffic prediction. In our work, we utilize data spanning from September 18th, 2017, to March 4th, 2018.

Urban Dataset: This dataset comprises traffic flow data from 7 sensors located on urban streets in District 4, Oakland, California. It represents a more complex traffic environment with frequent stops, intersections, and diverse traffic patterns, posing a greater challenge for forecasting models. The time range of the data used is consistent with the Freeway dataset.

PEMS08 Dataset: This large-scale dataset was collected from 170 sensors on the highways of the San Bernardino area over 62 days, from July 1st to August 31st, 2016. Its larger and more complex road network topology makes it a challenging benchmark for testing a model’s ability to handle intricate, large-scale spatio-temporal dependencies, and thus serves as a robust testbed for the scalability of our framework.

The datasets are split strictly chronologically into training (64%), validation (16%), and testing (20%) sets to maintain temporal causality. The exact date boundaries are provided in Table 1. Furthermore, regarding the construction of periodic features at the boundaries (e.g., the beginning of the test set), we retrieve the necessary historical data from the preceding validation or training sets. For the initial samples of the entire dataset where historical data is unavailable (indices or ), we apply zero-padding to ensure that no future data or invalid negative indices are accessed. This strategy guarantees a rigorous leak-free evaluation.

thumbnail
Table 1. Statistics and chronological partitioning boundaries of the experimental datasets.

https://doi.org/10.1371/journal.pone.0340389.t001

4.1.2. Evaluation metrics.

The performance of our proposed model and the baseline methods is quantitatively assessed from different perspectives using three widely-adopted evaluation metrics. Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Mean Absolute Percentage Error (MAPE)—measure the magnitude of prediction errors.

Mean Absolute Error (MAE): This metric measures the average magnitude of the errors in a set of predictions, without considering their direction. It is generally robust to outliers. The formula is defined as:

(16)

where is the set of observed (non-missing) data points, is the predicted value, and is the ground truth value.

Root Mean Squared Error (RMSE): This metric is the square root of the average of squared differences between prediction and actual observation. It is more sensitive to large errors than MAE, penalizing them more heavily. The formula is:

(17)

Mean Absolute Percentage Error (MAPE): This metric expresses the prediction error as a percentage of the actual value, providing a relative and easily interpretable measure of accuracy. The formula is:

(18)

4.1.3. Baseline methods.

To comprehensively evaluate the performance of our proposed FE-iTransformer, we compare it against a diverse and representative set of baseline methods. These baselines range from classical machine learning models to state-of-the-art deep learning architectures, providing a rigorous benchmark for our approach. The selected methods are as follows:

SVR (Support Vector Regression): A classical machine learning model that utilizes kernel methods to effectively capture non-linear relationships in the data.

LSTM (Long Short-Term Memory): A seminal recurrent neural network (RNN) architecture specifically designed to capture long-term temporal dependencies in sequential data.

DLinear: A simple yet surprisingly effective model for long-term forecasting, which is based on a single linear layer and has been shown to outperform many complex models.

Transformer: The original Transformer architecture adapted for time series forecasting, serving as a fundamental benchmark for attention-based models.

Autoformer: An efficient Transformer variant designed for long-sequence forecasting, which introduces a decomposition architecture and an Auto-Correlation mechanism.

PatchTST: A state-of-the-art Transformer-based model that segments time series into patches, enabling the model to capture local semantic information more effectively.

iTransformer: A recent state-of-the-art model that inverts the role of time steps and variables in the Transformer’s attention mechanism. As our FE-iTransformer utilizes iTransformer as its predictor backbone, this serves as the most crucial baseline to directly quantify the performance gain brought by our proposed Feature Enhancement Module (FEM).

4.1.4. Implementation details.

For a fair comparison, all models were implemented in PyTorch and trained using the Adam optimizer with an initial learning rate of 0.0005 and a weight decay of 1e-4. We employed a learning rate scheduler that reduces the learning rate on a plateau of the validation loss. The models were trained for a maximum of 100 epochs with a batch size of 32. An early stopping strategy was applied with a patience of 15 epochs to prevent overfitting and select the best model based on its performance on the validation set.

The historical input sequence length () was set to 32 time steps. The prediction horizon () was set to 4 steps, corresponding to forecasts at future intervals of 30, 60, 90, and 120 minutes. For our proposed FE-iTransformer and all other Transformer-based baselines, the model dimension () was set to 512, the number of attention heads () was set to 8, and the dropout rate was 0.1. Specifically for our FE-iTransformer, the number of encoder layers () was set to 3.

4.2. Main results and analysis on primary datasets

This section details the core experimental results on our primary datasets, Freeway and Urban. The analysis is structured to first establish the overall superiority of the FE-iTransformer against a comprehensive set of baselines, and then to conduct in-depth ablation studies to attribute this performance gain to our proposed architectural innovations.

4.2.1. Comprehensive performance comparison.

The comprehensive quantitative results for all models on the Freeway and Urban datasets are presented in Table 2. The performance is evaluated at four future prediction horizons: 30, 60, 90, and 120 minutes. For each horizon, we report the MAE, RMSE, MAPE, and R² metrics. The best-performing model for each metric is highlighted in bold.

thumbnail
Table 2. Comprehensive performance comparison on freeway and urban datasets.

https://doi.org/10.1371/journal.pone.0340389.t002

Second, the comparison with contemporary Transformer-based architectures reveals a more nuanced but equally important superiority. While models like Transformer, Autoformer, and PatchTST possess powerful sequence modeling capabilities, our FE-iTransformer consistently demonstrates a clear edge. This suggests that simply applying a powerful backbone is insufficient. The key differentiator lies in our two-stage “extract-and-enhance” paradigm. By first distilling a potent, globally-aware context vector and then using it to enrich the input sequence, our framework provides the predictor with a more comprehensive understanding of the overarching traffic dynamics. This architectural choice proves to be more effective than the tightly-coupled, end-to-end feature extraction and prediction mechanisms common in other Transformer variants, especially for mitigating error accumulation in long-horizon forecasting. The consistent outperformance, therefore, is not merely an incremental improvement but a validation of our core design philosophy.

4.2.2. Ablation studies.

To dissect the sources of FE-iTransformer’s superior performance and validate our core design principles, we conducted a series of ablation studies on the more complex Urban dataset. These studies are designed to first quantify the overall contribution of the Feature Enhancement Module and then to analyze the necessity of its individual components.

  1. Effectiveness of the Feature Enhancement Module (FEM)

To directly measure the impact of our primary innovation, we compare the full FE-iTransformer model against its pure iTransformer backbone. The results of this critical comparison are summarized in Table 3.

thumbnail
Table 3. Ablation Study: Impact of the Feature Enhancement Module (FEM).

https://doi.org/10.1371/journal.pone.0340389.t003

The data unequivocally demonstrates the significant value added by the FEM. On the Urban dataset, the introduction of the FEM reduces the Mean Absolute Error (MAE) by 0.9%, 9.1%, 14.2%, and a substantial 18.8% for the 30, 60, 90, and 120-minute horizons, respectively. This trend, where the performance gain becomes increasingly pronounced with a longer prediction horizon, strongly suggests that the global context provided by the FEM is highly effective at mitigating the error accumulation problem inherent in long-range forecasting. By enriching the input sequence with a globally-aware context vector, the FEM provides the predictor with a more informative and robust representation, leading to a dramatic improvement in forecasting accuracy.

To provide a qualitative and more intuitive illustration of the performance gains quantified in Table 3, Fig 2 presents a focused visual comparison on the Urban dataset. This visualization plots the 120-minute prediction results of our full FE-iTransformer against its pure iTransformer backbone over a continuous 48-hour period, offering insights into their behavior within a more complex and dynamic traffic environment.

thumbnail
Fig 2. Detailed 48-hour performance comparison on the Urban dataset for the 120-minute prediction horizon.

The plotted period (Feb 01–02, 2018) corresponds to workdays (Thursday and Friday).

https://doi.org/10.1371/journal.pone.0340389.g002

The visualization compellingly demonstrates the sustained and robust performance improvement provided by the Feature Enhancement Module. Throughout the two full daily cycles, which are characterized by less regular patterns compared to highway traffic, our model (solid line) tracks the ground truth with significantly higher fidelity. The baseline iTransformer (dashed line), lacking the global context provided by the FEM, exhibits more pronounced deviations and struggles to capture the correct magnitude and timing of the traffic peaks.

This visual evidence strongly corroborates our quantitative findings on the Urban dataset, validating that our feature enhancement paradigm is crucial for empowering the model to achieve reliable and accurate predictions, especially over extended, multi-cycle forecasting horizons in complex urban scenarios.

  1. Necessity of Individual FEM Components

To assess the contribution of each branch in the FEM and validate its multi-branch architecture, we conducted an ablation study in which the three parallel branches were removed one at a time, yielding three model variants with all other components held constant:

w/o Spatio-temporal: Removes the Spatio-temporal Branch.

w/o Periodicity: Removes the Periodicity Branch.

w/o Temporal Context: Removes the Temporal Context Branch.

The performance of these variants is compared against our full model on the Urban dataset, with the results summarized in Table 4.

thumbnail
Table 4. Ablation Study of Individual Components within the FEM.

https://doi.org/10.1371/journal.pone.0340389.t004

The results, summarized in Table 4, reveal the distinct and complementary roles of each branch within the Feature Enhancement Module. A quantitative analysis of the performance degradation upon the removal of each component confirms their respective and consistent contributions to the model’s overall forecasting accuracy on the Urban dataset.

The removal of the Temporal Context Branch (w/o Temporal Context) leads to the most dramatic performance degradation across all horizons. At the 30-minute horizon, the MAE increases dramatically from 16.23 to 55.26, a surge of approximately 240%. This substantial increase in error underscores the critical importance of explicit timestamp information for the model to accurately anchor its predictions in a specific temporal context.

Similarly, removing the Periodicity Branch (w/o Periodicity) also results in a severe decline in performance. The MAE at the 30-minute horizon rises from 16.23 to 26.83, an increase of about 65%. This finding highlights the necessity of modeling historical cyclical patterns (daily and weekly) to establish a robust baseline for forecasting, especially in the more complex and pattern-driven urban environment.

Finally, the exclusion of the Spatio-temporal Branch (w/o Spatio-temporal) leads to a clear and consistent degradation in prediction accuracy across all prediction horizons. For instance, the MAE increases from 16.23 to 18.25 at the 30-minute mark, and from 17.52 to 18.97 at the 120-minute mark. This demonstrates that even in the presence of strong periodic and temporal signals, the localized spatial patterns and immediate temporal dynamics captured by the CNN-LSTM pipeline provide unique and non-redundant information that is essential for achieving the highest level of accuracy.

An interesting and crucial observation from comparing Table 3 and Table 4 is that the performance of our model variants with a single branch removed (e.g., ‘w/o Spatio-temporal’ with MAE 18.25 at 30 min) is not only significantly worse than the full model (MAE 16.23) but is also inferior to the vanilla iTransformer baseline (MAE 16.37). This phenomenon is not a model flaw but rather a testament to the synergistic nature of our Feature Enhancement Module. We attribute this systematic degradation to the “Latent Distribution Distortion” caused by feature decoupling. In the proposed framework, the FEM branches are optimized jointly to capture complementary components of the traffic signal: the Periodicity Branch explicitly models regular recurring patterns (e.g., daily/weekly cycles), while the Spatio-temporal Branch targets complex local dynamics from the raw input. Consequently, the branches become specialized and interdependent. When a single branch is removed, the resulting global context vector becomes a distorted representation covering only a partial information spectrum (e.g., regular patterns without dynamic context, or dynamics without periodic reference).

Crucially, this distorted feature vector is structurally inferior to the raw input. The vanilla iTransformer backbone, while lacking explicit enhancement, processes the raw sequence which naturally preserves the complete signal integrity (containing both regularities and variations). In contrast, the partial FEM injects a structurally incomplete representation that acts as out-of-distribution noise to the predictor. This forces the backbone to map a skewed feature space to the ground truth, resulting in higher systematic errors than simply learning from the raw, undistorted data.

4.3. Validation on a large-scale benchmark

While the ablation studies in the previous section rigorously validated our framework’s internal mechanisms, a potential concern is its scalability and performance on larger, more complex transportation networks. To address this, we conducted a targeted validation experiment on the large-scale PEMS08 public benchmark, which features 170 sensors and significantly more intricate spatio-temporal dynamics. We focus our analysis on the most challenging 120-minute long-horizon forecasting task, as this is where a model’s robustness and ability to mitigate error accumulation are most critically tested.

The performance comparison on the PEMS08 dataset is presented in Table 5. The results unequivocally demonstrate both the successful scalability and the state-of-the-art performance of our proposed framework.

thumbnail
Table 5. Performance Comparison on the PEMS08 Dataset for the 120-min Horizon.

https://doi.org/10.1371/journal.pone.0340389.t005

First, a vertical comparison with its backbone reveals the sustained effectiveness of our core paradigm in a large-scale setting. Even in this complex, high-dimensional environment, FE-iTransformer (MAE 24.02) achieves a remarkable 19.1% reduction in Mean Absolute Error compared to the vanilla iTransformer (MAE 29.68). This provides compelling evidence that the global context vector distilled by the FEM is a robust and scalable mechanism, capable of significantly enhancing the underlying predictor’s capabilities even when faced with intricate, large-scale spatio-temporal dynamics.

Second, a horizontal comparison across different model families highlights the architectural advantages of our approach. We observe that while simple linear models like DLinear show surprisingly competitive performance, they likely benefit from mitigating overfitting on long-horizon tasks but may lack the capacity to capture complex non-linearities. On the other hand, sophisticated Transformer variants such as Autoformer and PatchTST do not exhibit a clear advantage over the vanilla Transformer or even iTransformer in this specific task, with their MAE scores clustering in the 28–30 range. This suggests that their specialized attention mechanisms or patching strategies, while effective in other domains, may not be optimally suited for capturing the unique, multi-faceted dependencies of traffic flow data at this scale.

In stark contrast, our FE-iTransformer establishes a new, significantly lower error benchmark (MAE 24.02). This substantial performance leap indicates that the key to advancing long-horizon traffic forecasting may not lie solely in refining the self-attention mechanism itself, but rather in how the model is supplied with rich, pre-distilled, domain-specific information. Our “extract-and-enhance” philosophy, by decoupling the complex task of multi-faceted feature extraction from the sequence-to-sequence prediction, proves to be a more effective and robust strategy. It empowers a powerful predictor like iTransformer to focus on its core strength—modeling temporal dependencies—while being continuously informed by a comprehensive global context, ultimately leading to a new level of prediction accuracy.

To provide a more intuitive understanding of our model’s predictive behavior, Fig 3 visualizes the 120-minute prediction results from FE-iTransformer and several key baselines against the ground truth over a 7-day period.

thumbnail
Fig 3. Multi-model prediction performance on the PEMS08 dataset for the 120-minute horizon over a 7-day period.

The gray shaded area indicates the weekend (Aug 20–21), and the white area represents workdays.

https://doi.org/10.1371/journal.pone.0340389.g003

The visualization compellingly illustrates our model’s superior fitting capability. The prediction curve of FE-iTransformer (solid red line) consistently aligns more closely with the ground truth (solid blue line) compared to all other methods. This is particularly evident during traffic peaks and troughs, where many baseline models—including its own backbone iTransformer (dashed blue line) and other competitive models like Autoformer (brown line)—exhibit significant lag, overshooting, or under-prediction.

This qualitative evidence, combined with the quantitative results, strongly suggests that the superior performance stems from the rich, multi-faceted global context provided by the FEM. This allows the FE-iTransformer to maintain a much more accurate and stable trend, demonstrating its robust capability in long-horizon forecasting on complex, large-scale networks. This supplementary experiment, therefore, provides strong evidence that our “extract-and-enhance” paradigm is not only effective in controlled settings but also a robust and scalable solution for real-world scenarios.

4.4. Visualization of prediction results

To provide a qualitative and more intuitive assessment of our model’s performance on the primary datasets, this section visualizes the prediction results of FE-iTransformer against the ground truth and key baseline methods. These visualizations serve as a visual counterpart to the quantitative results presented in Section 4.2, offering insights into the models’ dynamic behavior.

Fig 4 presents a multi-faceted comparison across both the Freeway and Urban datasets at short-term (30-min) and long-term (120-min) prediction horizons. In each subplot, we plot the predictions from our FE-iTransformer, its backbone iTransformer, and other representative baselines against the ground truth curve for a 48-hour period starting from Jan 29, 2018. As can be observed, our model’s predictions (solid red line) consistently align more closely with the ground truth (solid blue line) compared to the other methods. The performance gap is particularly evident in the 120-minute horizon plots (right column), where many baseline models exhibit significant lag or overshooting. In contrast, the FE-iTransformer, benefiting from the global context provided by the FEM, maintains a much more accurate and stable trend, demonstrating its superior capability in long-horizon forecasting. A similar, even more pronounced, superior fitting capability on the large-scale PEMS08 dataset has already been demonstrated and analyzed in Section 4.3.

thumbnail
Fig 4. Visualization of multi-model prediction performance.

The plots compare FE-iTransformer and baselines on the Freeway (top) and Urban (bottom) datasets for 30-minute (left) and 120-minute (right) horizons. The data spans a 48-hour workday period (Feb 01–02, 2018).

https://doi.org/10.1371/journal.pone.0340389.g004

4.5. Computational efficiency and deployment analysis

To evaluate the operational cost of the proposed two-stage architecture, we conducted a quantitative comparison between the FE-iTransformer and its backbone on the PEMS08 dataset. Table 6 presents the efficiency metrics measured with a batch size of 32. The experiments were conducted on a standard NVIDIA GeForce MX350 Laptop GPU.

thumbnail
Table 6. Computational efficiency comparison on PEMS08 (170 Nodes).

https://doi.org/10.1371/journal.pone.0340389.t006

As shown, the Feature Enhancement Module (FEM) introduces a slight increase in parameters (+5.8%) and a moderate increase in theoretical FLOPs (+12.6%), indicating that the module is relatively lightweight. We observe a notable increase in inference latency (+40.4%), rising from 73.03 ms to 102.54 ms per batch. This is primarily attributed to the sequential processing required by the recurrent components in the FEM, which restricts the parallel efficiency typical of Transformer architectures.

While this architectural design leads to increased latency, it represents a strategic trade-off between efficiency and robustness. The Spatio-temporal Branch accounts for a significant portion of this latency, its complexity is justified by its specific functional role, which differs fundamentally from the Periodicity Branch. Although ablation studies (Table 4) suggest that removing this branch causes a smaller rise in MAE compared to removing periodic features, this metric primarily reflects the dominance of repetitive patterns in traffic data. The Periodicity Branch captures these easy-to-learn base patterns. In contrast, the Spatio-temporal Branch is tasked with the far more challenging objective: capturing high-frequency, non-recurrent dynamics and sudden traffic shifts. Simpler alternatives, such as standard CNN-GRU pipelines, often lack the global context awareness to distinguish these anomalies from noise. Therefore, the additional computational cost is a necessary investment to ensure model robustness in complex, dynamic scenarios.

Deployment Feasibility: Despite the increased latency, we believe the FE-iTransformer remains feasible for practical Intelligent Transportation Systems (ITS):

  1. Time Constraints: The inference time of ~102 ms is well within the typical 5-minute (300,000 ms) data collection interval of traffic systems, leaving sufficient margin for real-time operations.
  2. Resource Usage: The peak memory consumption (~124 MB) remains manageable for standard industrial edge devices.
  3. Cost-Benefit Balance: Considering the 19.1% reduction in MAE for long-horizon forecasting, we consider the additional computational overhead to be an acceptable trade-off for high-stakes traffic management tasks.

4.6. Applicability scope and methodological positioning

It is worth discussing the positioning of FE-iTransformer relative to Graph Neural Networks (GNNs), which represent another dominant paradigm in traffic forecasting. GNNs (e.g., STGCN, DCRNN) achieve high performance by explicitly leveraging pre-defined adjacency matrices to model spatial topology. In scenarios where high-quality, static road graphs are readily available, GNNs remain a powerful choice.

However, the design of FE-iTransformer targets a different, yet equally critical practical constraint: the “No-Graph” or “Weak-Graph” setting. In many real-world deployments—such as ad-hoc sensor networks or privacy-preserved data sharing—topological information is often missing or unreliable. Under these constraints, graph-dependent models may suffer from performance degradation or become inapplicable.

Therefore, our experimental design focuses on comparing FE-iTransformer against graph-agnostic baselines (e.g., iTransformer, PatchTST). This ensures a rigorous evaluation where all models operate under identical information conditions (i.e., time series data only). The results demonstrate that FE-iTransformer effectively compensates for the lack of explicit topological priors by distilling implicit spatio-temporal dynamics via the FEM. This positions the proposed framework not as a direct competitor to GNNs in graph-rich environments, but as a robust, deployment-ready alternative for scenarios where maintaining a reliable graph structure is costly or infeasible.

5. Conclusion

In this paper, we proposed a novel two-stage framework for high-accuracy, long-horizon traffic flow forecasting, named the Feature-Enhanced iTransformer (FE-iTransformer). Motivated by the persistent challenges in capturing complex spatio-temporal dependencies without relying on predefined graph structures, our work introduces an “extract-and-enhance” paradigm. This paradigm first employs a comprehensive Feature Enhancement Module (FEM) to distill a potent global context vector from multi-source information, including spatio-temporal dynamics, periodic patterns, and temporal context. Subsequently, an innovative per-step feature enhancement mechanism leverages this global context to enrich the original input sequence, which is then processed by a powerful iTransformer backbone for final prediction.

The effectiveness of the proposed FE-iTransformer was rigorously validated through a multi-faceted experimental evaluation. First, in-depth experiments on our primary datasets (Freeway and Urban) established the model’s superior performance over a wide range of state-of-the-art baselines and, through detailed ablation studies, provided compelling evidence for the efficacy of our two-stage design. The introduction of the FEM was shown to significantly improve the performance of the pure iTransformer backbone, with the performance gain becoming increasingly pronounced in the most challenging long-horizon forecasting tasks. Second, supplementary experiments on the large-scale PEMS08 benchmark further confirmed the framework’s robustness and scalability, where it again achieved state-of-the-art results, reducing the MAE by 19.1% in the 120-minute horizon. These key findings strongly validate the superiority of decoupling the complex feature extraction process from the final sequence prediction task.

Despite the promising results, this study identifies opportunities for future improvement. First, the current model relies solely on historical traffic observations; future iterations will extend the Feature Enhancement Module (FEM) to incorporate exogenous signals such as weather conditions and traffic incidents. Second, regarding computational efficiency, the two-stage architecture introduces additional overhead compared to single-stage baselines. While our analysis confirms that the inference latency remains well within real-time deployment limits (as discussed in Section 4.5), the sequential processing in the FEM limits parallelization efficiency. Future work will therefore explore lightweight alternatives, such as adaptive gating mechanisms or efficient attention variants, to further optimize the accuracy-efficiency trade-off for strictly resource-constrained edge devices.

References

  1. 1. Zhao Z, Chen J. Application of artificial intelligence technology in the economic development of urban intelligent transportation system. PeerJ Comput Sci. 2025;11:e2728. pmid:40134871
  2. 2. Liu R, Shin SY. A review of traffic flow prediction methods in intelligent transportation system construction. Computer Science and Mathematics. 2025. https://www.preprints.org/manuscript/202503.0643/v1
  3. 3. Lu X, Yan Y, Wei J, Zhang X, Wang C, Li H, et al. Reinforcement Learning Assisted Triboelectric Self‐Regulating Intelligent Transportation System. Advanced Functional Materials. 2025;35.
  4. 4. Das D, Banerjee S, Chatterjee P, Ghosh U, Biswas U. Blockchain for Intelligent Transportation Systems: Applications, Challenges, and Opportunities. IEEE Internet Things J. 2023;10(21):18961–70.
  5. 5. Chen W, Liu B, Han W, Li G, Song B. Dynamic path planning based on traffic flow prediction and traffic light status. In: Tari Z, Li K, Wu H, editors. Algorithms and architectures for parallel processing. Singapore: Springer Nature Singapore. 2024:419–38.
  6. 6. Wu F, Tan H, Zhang L, Wen S, Hu T. Multivariate Machine Learning Model Based on YOLOv8 for Traffic Flow Prediction in Intelligent Transportation Systems. IEEE Access. 2025;13:105091–100.
  7. 7. Wang Y, Lv X, Gu X, Li D. A New Edge Computing Method Based on Deep Learning in Intelligent Transportation System. IEEE Trans Veh Technol. 2025;74(2):2210–9.
  8. 8. Zhang M, Cao JX. Dynamic Routing Spatial-Temporal Transformer Model for Traffic Flow Prediction. IEEE Trans Intell Transport Syst. 2025;26(7):10116–30.
  9. 9. Mu H, Aljeri N, Boukerche A. Hierarchical multi-scale spatio-temporal semantic graph convolutional network for traffic flow forecasting. Journal of Network and Computer Applications. 2025;238:104166.
  10. 10. Ji A, Liu Z, Su L, Dai Z. A hybrid framework for spatio-temporal traffic flow prediction with multi-scale feature extraction. Information Sciences. 2025;716:122259.
  11. 11. Song Y, Luo R, Zhou T, Zhou C, Su R. Graph Attention Informer for Long-Term Traffic Flow Prediction under the Impact of Sports Events. Sensors (Basel). 2024;24(15):4796. pmid:39123843
  12. 12. Wang H, Zhang H, Chen L, Chen L. Auto-correlation based spatio-temporal adaptive transformer traffic flow prediction. Proceedings of the Institution of Mechanical Engineers, Part D: Journal of Automobile Engineering. 2025.
  13. 13. Guo J, Wu X. Research on freeway traffic flow prediction method based on Att-Conv-LSTM model. In: Seventh International Conference on Traffic Engineering and Transportation System (ICTETS 2023), 2024. 133.
  14. 14. Wang Y, Ke S, An C, Lu Z, Xia J. A Hybrid Framework Combining LSTM NN and BNN for Short-term Traffic Flow Prediction and Uncertainty Quantification. KSCE Journal of Civil Engineering. 2024;28(1):363–74.
  15. 15. Feng R, Chen M, Song Y. Learning traffic as videos: Short-term traffic flow prediction using mixed-pointwise convolution and channel attention mechanism. Expert Systems with Applications. 2024;240:122468.
  16. 16. Chen H, Han H, Chen Y, Chen Z, Gao R, Li X. A traffic speed prediction algorithm for dynamic spatio-temporal graph convolutional networks based on attention mechanism. J Supercomput. 2025;81(1):18.
  17. 17. Diao C, Zhang D, Liang W, Jiang M, Li K. A Novel Attention-Based Dynamic Multi-Graph Spatial-Temporal Graph Neural Network Model for Traffic Prediction. IEEE Trans Emerg Top Comput Intell. 2025;9(2):1910–23.
  18. 18. Yingran Z, Chao L, Rui S. Enhancing Traffic Flow Forecasting With Delay Propagation: Adaptive Graph Convolution Networks for Spatio-Temporal Data. IEEE Trans Intell Transport Syst. 2025;26(1):650–60.
  19. 19. Kong J, Fan X, Zuo M, Deveci M, Jin X, Zhong K. ADCT-Net: Adaptive traffic forecasting neural network via dual-graphic cross-fused transformer. Information Fusion. 2024;103:102122.
  20. 20. Omar M, Yakub F, Wijaya AA, Mohammad AF, Maslan MN, Dhamanti I, et al. Evaluation of Encoder-Only Transformer for Multi-Step Traffic Flow Prediction. IEEE Access. 2025;13:106349–68.
  21. 21. Yin X, Wu G, Wei J, Shen Y, Qi H, Yin B. Deep Learning on Traffic Prediction: Methods, Analysis, and Future Directions. IEEE Trans Intell Transport Syst. 2022;23(6):4927–43.
  22. 22. Time series traffic flow prediction with hyper-parameter optimized ARIMA models for intelligent transportation system. JSIR. 2022;81(04).
  23. 23. Hu J, Gao P, Yao Y, Xie X. Traffic flow forecasting with particle swarm optimization and support vector regression. In: 17th International IEEE Conference on Intelligent Transportation Systems (ITSC). Qingdao, China: IEEE; 2014 :2267–8. http://ieeexplore.ieee.org/document/6958049/
  24. 24. Wang Q, Liu W, Wang X, Chen X, Chen G, Wu Q. GMHANN: A Novel Traffic Flow Prediction Method for Transportation Management Based on Spatial-Temporal Graph Modeling. IEEE Trans Intell Transport Syst. 2024;25(1):386–401.
  25. 25. Hochreiter S, Schmidhuber J. Long Short-Term Memory. Neural Computation. 1997;9(8):1735–80.
  26. 26. Zhang W, Yu Y, Qi Y, Shu F, Wang Y. Short-term traffic flow prediction based on spatio-temporal analysis and CNN deep learning. Transportmetrica A: Transport Science. 2019;15(2):1688–711.
  27. 27. Liu L, Wu M, Lv Q, Liu H, Wang Y. CCNN-former: Combining convolutional neural network and Transformer for image-based traffic time series prediction. Expert Systems with Applications. 2025;268:126146.
  28. 28. Sattarzadeh AR, Pathirana PN, Kutadinata R, Huynh VT. Extracting long‐term spatiotemporal characteristics of traffic flow using attention‐based convolutional transformer. IET Intelligent Trans Sys. 2023;18(10):1797–814.
  29. 29. Wang X, Jiang H, Zeng T, Dong Y. An adaptive fused domain-cycling variational generative adversarial network for machine fault diagnosis under data scarcity. Information Fusion. 2026;126:103616.
  30. 30. Yan J, Cheng Y, Zhang F, Li M, Zhou N, Jin B. Research on multimodal techniques for arc detection in railway systems with limited data. Structural Health Monitoring. 2025.
  31. 31. Chen Y, Zhang Q, Yu F. Transforming traffic accident investigations: a virtual-real-fusion framework for intelligent 3D traffic accident reconstruction. Complex Intell Syst. 2025;11(1):76.
  32. 32. Chen Y, Yu F, Chen L, Jin G, Zhang Q. Predictive modeling and multi-objective optimization of magnetic core loss with activation function flexibly selected Kolmogorov-Arnold networks. Energy. 2025;334:137730.
  33. 33. Yu B, Yin H, Zhu Z. Spatio-Temporal Graph Convolutional Networks: A Deep Learning Framework for Traffic Forecasting. In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, 2018. 3634–40.
  34. 34. Chen Y, Zheng L, Tan Z. Roadside LiDAR placement for cooperative traffic detection by a novel chance constrained stochastic simulation optimization approach. Transportation Research Part C: Emerging Technologies. 2024;167:104838.
  35. 35. Hu X, Tan L, Tang T. M2BIST-SPNet: RUL prediction for railway signaling electromechanical devices. J Supercomput. 2024;80(12):16744–74.
  36. 36. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN. Attention Is All You Need. In: arXiv, 2023. http://arxiv.org/abs/1706.03762
  37. 37. Wu H, Xu J, Wang J, Long M. Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting. arXiv. 2022. http://arxiv.org/abs/2106.13008
  38. 38. Nie Y, Nguyen NH, Sinthong P, Kalagnanam J. A time series is worth 64 words: long-term forecasting with transformers. 2023.
  39. 39. Liu Y, Hu T, Zhang H, Wu H, Wang S, Ma L, et al. iTransformer: Inverted Transformers Are Effective for Time Series Forecasting. arXiv. 2024. http://arxiv.org/abs/2310.06625