Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Adaptive traffic signal control using deep reinforcement learning: Toward smarter and safer urban mobility

  • Fayez Alanazi,

    Roles Conceptualization, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Civil Engineering Department, College of Engineering, Jouf University, Sakaka, Saudi Arabia

  • Ammar Armghan ,

    Roles Conceptualization, Investigation, Methodology, Project administration, Visualization, Writing – original draft, Writing – review & editing

    aarmghan@ju.edu.sa

    Affiliation Department of Electrical Engineering, College of Engineering, Jouf University, Sakaka, Saudi Arabia

  • Muhammad Tanveer,

    Roles Investigation, Methodology, Validation, Visualization, Writing – original draft

    Affiliation School of Systems and Technology (SST), University of Management and Technology, Lahore, Pakistan

  • Amr Yousef

    Roles Conceptualization, Formal analysis, Investigation, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliations Electrical Engineering Department, College of Engineering, University of Business and Technology, Jeddah, Saudi Arabia, Engineering Mathematics Department, Faculty of Engineering, Alexandria University, Alexandria, Egypt

Abstract

In today’s rapidly evolving Intelligent Transportation Systems (ITS), traditional systems for controlling traffic signals are often inadequate in optimizing real-time traffic flow due to their dependency on preset schedules and lack of adaptability to dynamically changing traffic signal phases. These systems cannot analyze dynamic signal timing changes, especially at multiple intersections, resulting in inefficient vehicle flow, longer queues, and higher levels of congestion. Thus, the need arises to develop intelligent systems capable of optimizing traffic flow in real time, reducing delays, and addressing the growing challenges of intelligent transportation systems. To address these requirements, a novel deep reinforcement learning framework that combines the Twin Delayed Deep Deterministic Policy Gradient (TD3) with prioritization-based Intelligent Traffic Control (P-ITC) is proposed for real-time traffic signal optimization using stability techniques. The module focuses on TD3’s stability-enhancing techniques, including clipped Q-learning, delayed and targeted policy updates, and smoothing. The system ensures robust signal timing decisions across intersection networks. PER prioritizes critical traffic signal experiences, ensuring the system learns from key events that influence real-time traffic flow. The proposed TD3P-ITC framework achieves maximum reductions in queue length (up to 22 at transport hub intersections and 25 at highways) and a 17.9 percent decrease (compared to baseline approaches) in simulated accident rates.

1. Introduction

Intelligent Transportation Systems (ITS) are becoming increasingly vital in addressing the growing challenges of vehicle congestion in urban environments [1]. Traffic signal control plays a crucial role in enhancing vehicle flow and mitigating queues at intersections [2]. Traditional frameworks fail to adapt dynamically to signal timing adjustments, thus requiring an intelligent approach to optimize [3]. In recent years, adaptive optimal control of traffic signals has attracted considerable attention in ITS, with RL emerging as a means to enhance traffic management efficiency and intelligence [4]. A key goal of ITS is to reduce queue length and throughput of traffic flow, with traffic signal control systems playing a vital role in optimizing traffic management at intersection points [5]. Recent advancements in DRL have significant potential to address the challenges of adaptive traffic signal control by learning optimal control policies from real-time traffic data [6]. With the growing urban traffic congestion, DRL dynamically optimizes traffic timings, thereby enhancing the efficiency of ITS [7]. For traffic signal control action space handles communication delays, thereby ensuring adaptability and scalability for real-time deployment in ITS [8]. A DRL-based real-time traffic analysis is employed, integrating real-time driving style attributes to optimize traffic flow and reduce computational requirements [9]. Building on this approach, the framework incorporates real-time passenger occupancy data in a transit signal priority, further enhancing its ability to minimize delay while optimizing vehicle flow efficiency [10].

RL-based traffic signal methods can efficiently learn from data but often require initial training with simulated data and subsequent fine-tuning with real-world inputs for optimal performance [11] by continuously updating Q-values by forecasting future traffic conditions, enabling the agent to adapt and optimize signal control strategies in real-time, ultimately enhancing intersection efficiency in ITS [12]. In DRL-based traffic signal control, the agent observes the traffic environment and makes decisions through either discrete or continuous action spaces to optimize signal control [13]. By integrating with policy updates in mixed-traffic environments, where both connected and non-connected vehicles coexist, it outperforms others by achieving the minimum average delay [14] with the focus on achieving training stability and faster convergence by guaranteeing the use of Double Deep Q-Networks (DDQN) and PER in the agent environment [15]. By incorporating expert guidance, DRL enhances the adaptability of traffic management in adaptive multiple-intersection environments [16]. The distributed agent-based DRL focuses on large-scale traffic signal control, leveraging the potential of multi-agent systems, and optimizes traffic flow across urban areas [17]. Due to the rise in urban populations and increasingly congested roadways, there is a demand for ITS that uses real-time and advanced algorithms to optimize traffic signal control, improve mobility, and mitigate the environmental impacts of congestion. The novelty of the TD3P-ITC framework lies in integrating DRL and PER to optimize traffic signal decisions. An integration of stability-enhanced techniques, such as target smoothing, ensures reliable decision-making. The key focus of this research includes:

  • Designing a TD3P-ITC system optimizes signal timings across multiple intersections simultaneously, ensuring smooth traffic flow and reduced delays in urban networks.
  • Applying PER helps focus learning on critical traffic scenarios, improve learning efficiency, and accelerate optimal decision-making in control.
  • Integrating delayed policy updates and target smoothing reduces overestimation bias, ensuring stable and reliable signal control in dynamic traffic conditions.
  • Balances key goals such as minimizing wait time, maximizing throughput, reducing congestion, and ensuring safety.

2. Recent works

Fereidooni et al. [18] utilize DRL algorithms, specifically DQN for SADRL and PPO for MADRL, as well as a real-time actuated control system (SMART) integrated with SUMO’s TraCI protocol. The dataset, sourced from the Snap4City platform, contains actual traffic data from Florence, Italy, under normal, medium, and heavy congestion conditions. With SMART obtaining the best tram/BRT prioritizing and the lowest Mean Travel Time (MTT), the results demonstrate that DRL-based algorithms perform better than Webster, SUMO, and MaMoTLO. The computing cost of real-time models like SMART is considerable, and there has been limited study of hybrid models that combine fixed-cycle efficiency with actuation adaptability. Another area where research is lacking is in scaling to larger urban networks.

Li et al. [19] presented a Federated PPO-based algorithm for intelligent traffic signal control spanning multiple domains, enabling secure, distributed joint training across many crossings. The dataset includes simulated traffic flow data from common crossings. Compared with the individual PPO, the results demonstrate a 27.34% decrease in vehicle waiting time and a 47.69% faster convergence rate. However, there is a lack of investigation into its practical application in fast-paced urban settings, as well as a corresponding lack of information about intersections. Wang et al. [20] improved junction modeling; this research proposes a multi-layered graph-masking-based learning approach for ATSC with many intersections. This algorithm combines upper and lower graph structures. Results demonstrate excellent scalability and generalizability, surpassing current approaches in traffic optimization and delay reduction. The dataset encompasses both real-world and synthetic urban road networks. Incorporating pedestrian data and improving graph layers, such as lane graphs, for more detailed modeling will be the focus of future effort, while handling intersections with more than four directions remains an unexplored area of research.

To improve both the length and phase selection of traffic signal control, Bouktif et al. [21] presented the MP-DQN framework, a parameterized DRL architecture. Results demonstrated a 33% improvement in travel time compared to traditional methods and a 7.5% improvement compared to GA and PSO approaches in the dataset, which was derived from simulations conducted in the SUMO environment. However, there are few studies on scaling the MP-DQN model for real-time deployment in large, multi-intersection networks, due to its increased computational complexity.

Zhang et al. [22] achieved a balance between efficiency, safety, and environmental effect. This study proposes a new method for real-time adaptive traffic signal regulation based on DRL. The methodology outperforms existing methods by combining the Dueling Double Deep Q Network (D3QN), resulting in a 16% reduction in vehicle congestion. The trade-off demonstrates the advantages of optimizing numerous targets, despite a little increase in waiting time (0.64%). The framework’s contributions to smarter, more sustainable traffic control solutions are particularly evident in high-traffic situations, where it delivers significant advantages. Cao et al. [23] employed SUMO for simulated urban traffic; this research presents an optimization strategy for traffic signals based on the Deep Q Network (DQN) algorithm. Compared with DQN and A2C, G-DQN achieves better results in terms of vehicle queue length and time, especially during peak hours. This is due to its enhanced traffic state representation and network convergence. Future research should improve deep reinforcement learning algorithms, such as A3C and TD3, to handle complex traffic situations and address multi-intersection scenarios, as this study only covers single-intersection models.

Cai & Wei [24] proposed an improved technique for controlling traffic signals based on DRL, incorporating a Dueling Double Deep Q-learning Network (D3DQN), a noise network, and priority experience replay (PER) to enhance the model and accelerate convergence. Significant improvements in queue length and waiting time, and faster convergence, were observed in results using a dataset comprising simulated traffic flow data from several crossings. However, there is a lack of studies on how to adapt the strategy for multi-agent collaborative control in real-world applications and on addressing the cold-start problem during early training. To address scalability and intersectional variability, Bao et al. [25] proposed Federated Learning-based Reinforcement Learning (FL-RL). The method incorporates local agent information into a global model. With real-world traffic data from Monaco included in the dataset, we can observe significant improvements in traffic flow efficiency and a 64.48% mitigation in total waiting time. When it comes to improving model performance and adaptability in varied urban traffic situations, there is a lack of research on sophisticated aggregation methods and meta-learning techniques.

An adaptive traffic control system, the M2SAC framework, was introduced by Zhang et al. [26], which uses a multi-agent-based masking of the soft actor-critic model to optimize signal light timing. Compared with baseline approaches and more conventional models, such as Webster’s formula, the dataset of real-world traffic scenarios from Melbourne, Australia, demonstrated an improvement of 5.17%. Problems with the model’s scalability in larger, more complex metropolitan settings, as well as the computational challenges of deploying it in real time under dynamic traffic conditions, warrant further study. Yazdani et al. [27] expanded the reward function to account for road user interactions and optimized automobile and pedestrian traffic flow using a Deep RL technique with Double Deep Q-Networks (DDQN). Using actual traffic data from the SCATS system improves vehicle travel time by 9% and reduces total user delays by 5%, particularly in scenarios involving high pedestrian volumes Table 1.

thumbnail
Table 1. Summary of related works on traffic control features.

https://doi.org/10.1371/journal.pone.0339207.t001

Zhao et al. [28] proposed a signal control optimization model for overflow prevention during nonrecurrent congestion, when the traffic volume and the bottleneck capacity are unknown, as in the case of traffic accidents. The approach anticipates available space in the remaining lanes of exit and arrival-departure curves based on partial vehicle connectivity information. It recalculates signal timing using a model predictive control system. Case studies show that overflow prevention is highly effective when connected vehicle penetration rates exceed 10 percent. The average delay reduction is 48.56 percent and 24.49 percent relative to adaptive signal control approaches that did not consider exit-lane conditions and those that did not consider a predictive model, respectively. A collaborative model that determines the quantity, width, type, distribution, and signal settings of the separated junctions to maximize capacity was introduced by Shi et al. [29] as a lane layout design and signal optimization model. Using modified saturation flow models, the method establishes relationships among three lane types: conventional-width lanes (CWL), special-width approach lanes (SWAL), and dedicated passenger-car lanes (DPCL). An optimization framework for mixed-integer quadratic programming is developed and tested, with road width as the primary input, using a branch-and-bound technique. With more adaptability and robustness for traffic groups of varying widths and structures, the numerical evidence shows that dedicated passenger car lanes and special-width lanes improve intersection capacity by an average of 12.51%.

K. Jalil et al. [30] provide a comprehensive review of collision avoidance (CA) devices in internet-connected vehicles (ICVs), focusing on sensor-based perception, communication technologies, and data-driven AI to enable real-time optimization. The paper also critically examines various CA techniques and their effectiveness in identifying and preventing static and dynamic barriers, as well as in preventing contact with other road users. Focusing on the value of integrated control systems, the review underlines the benefits of these technologies in enhancing vehicle performance, improving network-wide traffic efficiency, and preventing accidents, based on recent research and peer-reviewed materials. Current research on traffic signal control systems using DRL has several major limitations: they fail to achieve improved performance in real-time traffic networks, particularly at multiple intersections and with complex traffic signal phases. Numerous studies, such as Fereidooni et al. [18] and Li et al. [19], have demonstrated the efficacy of DRL models; however, they have not comprehensively addressed scalability and real-time implementation. There is also a significant lack of research on hybrid models that combine fixed-cycle economy with adaptive actuation. The suggested model addresses these problems by combining scalable DRL algorithms that enable real-time adaptation and by making it more generalizable across complex metropolitan networks. The proposed TD3P-ITC model addresses significant deficiencies identified in current research, including inefficient traffic flow, lengthy vehicle queues, varying weather conditions across extensive networks, elevated computational costs in real-time systems, and insufficient generalizability across multi-intersection configurations. The model is designed to perform well across urban networks by utilizing hybrid DRL approaches and offering real-time flexibility. This improves traffic flow and reduces delays. The model can handle complex traffic situations and provide a more long-term solution for adaptive traffic signal control by leveraging advanced learning techniques.

3. The proposed methodology

3.1. Problem formulation

Consider a real-time urban traffic network consisting of N = 5 intersections, as residential, commercial, highway, mixed-use, and transport hubs labeled as {1,2,3,4,5}, as locations where each intersection observes traffic conditions through 12 distinct features sampled at 1-minute intervals over 2000 observations. These features include normalized traffic volume , inferred queue length , one-hot encoded signal phase (red, green, yellow) , and weather conditions . The objective of this framework is to optimize traffic signal control decisions across a multi-intersection network. The goal is to minimize the average vehicle waiting time , maximize throughput efficiency minimize congestion and ensure safety. The model aims to adapt in real time to dynamic traffic conditions, weather variations, and emergency vehicle prioritization, ensuring efficient traffic flow, reduced delays, and optimized safety at intersections. The objective function is given in Equation (1).

(1)

Subject to:

The proposed TD3P-ITC addresses the problem by using DRL and applies PER to prioritize critical traffic events, ensuring effective learning from high-volume traffic scenarios. The safety constraint is achieved by penalizing accident-prone scenarios and by adapting signal phases to reduce risk.

As shown in Fig 1, the TD3P-ITC framework controls traffic signals in intelligent transportation systems in real-time. It receives time-stamped vehicle counts, speeds, vehicle types, queue lengths, signal phases, and weather conditions. Normalized and pre-processed characteristics comprise the model’s decision-state space. A reward function is used to minimize congestion, maximize throughput, and ensure safety, with reward functions for accidents and traffic congestion. The action space includes signal phase decisions and their corresponding signal durations, which the TD3P algorithm optimizes based on traffic experiences. The model uses PER to prioritize essential traffic events for efficient learning and dynamically adjusts traffic lights in real time to enhance traffic flow across multiple intersections. The signal phase decisions for each intersection point are allotted and optimized through a policy update. The updated green-light duration in the NS and EW directions ensures balanced traffic flow and minimizes congestion.

3.2. Materials and methods

The dataset used for this research is a real-time urban traffic dataset from multiple intersections comprising 2000 observations and 12 distinct features [31]. The features are traffic metrics such as time-stamp, location ID (1–5) sampled at one-minute time intervals, traffic volume, vehicle counts for different types (cars, trucks, bikes), and environmental factors such as weather conditions (sunny, rainy, foggy, cloudy, and windy), temperature, and humidity. Also, the data source provides accident reports and the current traffic signal status (red, green, yellow). The Smart Traffic Management Dataset is an open-source dataset for intelligent transportation and traffic control, containing 2,000 time-stamped traffic observations with 12 variables per case. With aggregated observations of city-wide traffic sensors rather than data from a specific deployed city, the dataset provides a synthetic, open-source environment for research into traffic analytics and machine learning. There is no particular location associated with the dataset; thus, it is geographically generic.

The data is not in absolute calendar dates but is a time-indexed traffic record intended to capture representative changes in daily traffic, such as peak and off-peak periods. In this work, we first pre-process and transform raw traffic variables, including vehicle counts, average velocity, signal state, vehicle type distributions, and environmental variables, into a 12-dimensional state representation per intersection. From there, we derive traffic volume, queue length, and stop count from vehicle density and a low-speed threshold, inspect sensor data to obtain average velocity and speed variance directly, one-hot-encode signal phases, represent vehicle composition as proportional ratios, and numerically code and normalize weather, temperature, and humidity. The dataset does not contain any past accident records. However, to model the accident indicators needed for this work—such as traffic density, speed variation, signal conflicts, and bad weather—in the SUMO environment, which is used solely to create and assess safety-aware rewards. The dataset contains simulated safety incidents in the SUMO environment, activated by traffic congestion, velocity variance, signal crossing opportunities, and poor weather. Instead of simulating historical crash causality, these simulated accident indicators offer controlled, reproducible risk-minimization measurement through safety-conscious incentive shaping and performance evaluation.

3.2.1. Pre-processing & traffic feature analysis.

The raw traffic signal dataset undergoes several pre-processing steps to make it suitable for training the TD3P-ITC system. The time-based features, such as the hour and day are extracted from the time-stamp. For feature scaling, min-max normalization is applied to ensure uniformity across different metrics and adaptively applies z-score standardization for a zero mean and unit variance as . The feature extraction includes normalizing traffic volume with the queue length calculated based on traffic volume and average vehicle speed . Fig 2a-Fig 2d shows important parts of traffic management that utilize advanced analytics of traffic features. The correlation heatmap shows the relationships among key traffic parameters, including vehicle counts, average speed, and weather conditions. These features are important for constructing an optimum state space for signal management systems. The accident incidence heatmap examines accident rates by signal status and time. It demonstrates how changes in signal phase affect accident frequency. The combined subplots also provide a comprehensive view of traffic patterns. The average speed of vehicles in different weather conditions is shown, while the other chart displays the number of vehicles on the road at various times of day. These assessments work together to help plan real-time adaptive traffic signal control, thereby enhancing safety and improving traffic flow. The research emphasizes the necessity of dynamic signal adjustments.

thumbnail
Fig 2. Traffic Environment Analysis a) Feature Correlation b) Accident Rate c) Average Speed Across Hours and d) Traffic Volume.

https://doi.org/10.1371/journal.pone.0339207.g002

3.2.2. State space model.

The state space gives the current condition of the traffic environment at time including the key features that define the flow of traffic and environment at each intersection point. The state space is composed of the following elements using Equations (2) and (3):

(2)

In contrast to the overall network state, Equation (2) describes the per-intersection state. Factors such as ambient variables, vehicle composition, average speed, predicted wait time, 12-dimensional feature vectors, and traffic volume are present at the intersection. All of these vectors, one for each of the N = 5 intersections in the network, add up to the global state. In this scenario, the state vector is 60 bytes, and the overall state dimension is 12N.

(3)

Here, the normalized traffic volume at is given as acts as a primary indicator of traffic congestion, and a high volume directly contributes to a higher negative reward, incentivizing the agent to clear traffic. The inferred queue length at based on vehicle speed and count, it is given as , one-hot encoded signal at is termed as and represents the one-hot encoded weather that impacts traffic flow by reducing speed during rainy days, and fog increases the distance, scaled temperature, and scaled humidity. Table 2 summarizes the key hyperparameters used in the experiments to train the proposed TD3P-ITC framework, including network learning rate, exploration rate, prioritized replay, reward weighting, and convergence.

The chosen properties ensure that control relevance, observability, and learning stability are all equal in the process. Optimizing a signal timing requires fundamental congestion indicators such as speed, queue length, and traffic volume. The sensitivity to stop-go vibrations and the safety risk can be represented by stop events and speed variance, eliminating the need to model actual collisions. Signal-phase encoding maintains deterministic phase control while also providing context-aware phase control. To estimate the effects of external disturbances on capacity and accident probability, we use environmental factors and vehicle composition to capture the heterogeneous discharge behavior. Pedestrian movement, turning ratios, and lane movements are not part of this model because they are either not included in the provided dataset, are highly intersection-specific, or would significantly increase the state’s dimensionality without being uniformly observable. They can employ their state representation in continuous-control reinforcement learning thanks to their exclusion techniques, which make it compact, scalable, and sensor-realistic.

3.2.3. Action vector representation.

Action function represents the traffic signal control decisions based on current traffic signal phases, including red, green, and yellow, along with the duration of green lights across multiple intersections. It is represented as a vector containing the green and yellow durations for the North-South (NS) and East-West (EW) directions. The agent learns to choose the optimal signal phase or timing to minimize congestion and waiting times. The inclusion of signal phases and green light durations is formalized as in Equation (4). The time-varying action is determined by adjusting the duration of the continuous green phase at time t based on the current signal phase state.

(4)

Where signal phases and the signal phase for the North-South direction can be red, green, or yellow, and similarly for the East-West direction. as signal phase decision. The green durations are are given for North-South and East-West directions, where indicates the minimum green light duration allowed, and as the maximum green light duration allowed to prevent excessive delays at the opposite intersection, the action space manages the length of the currently active green stage for North-South or East-West movement without optimizing discrete phase transitions. To ensure traffic safety, the signal controller enforces yellow phases and all-red clearing intervals automatically during all phase transitions. Each green-to-red transition begins with a yellow gap and complete red clearing before the opposing way is activated. The green time values are set based on traffic engineering and network saturation to prevent dangerous switching or severe hunger. All experiments limit the agent’s continuous action outputs to this permissible range, then apply the signal timing controller. Yellow and all-red phase safety restrictions are set outside the learning loop and not violated by policy. These two represents the durations that mitigate congestion and maximize throughput while considering the traffic volume at each intersection. Also, the represents the estimation of the expected penalty for taking in . The agent selects the signal phase that yields the best trade-off between traffic flow optimization and safety. The proposed agent does not perform discrete phase selection, it selects only duration. External factors drive the establishment and imposition of signal phase sequencing to ensure compliance with safety and regulatory standards. Only activities that are constrained by the length of the running green stage are considered by the TD3 policy to be part of the continuous action space. There is no discretization or mapping of continuous actions to phase decisions, as the continuous outputs are fed directly into the timing control variables. This ensures policy stability and compatibility with continuous-control reinforcement learning.

However, the TD3 control strategy is unable to learn discrete phase selection; it can only learn temporal modifications of continuous values. To ensure safety and practicality, a predetermined operating sequence controls the signal phases, and the actor network generates continuous control signals that indicate whether the active green interval should be extended or shortened. To ensure full compatibility with continuous-action reinforcement learning, the phase terms in the action formulation refer to the current context of active control rather than the policy-optimized choice variables. The TD3P-ITC performs action updates to adjust the green timings dynamically if queue at lane 1 exceeds a threshold, increase signal green time for lane 1. If throughput for lane 2 is high, reduce to avoid congestion in other lanes

3.2.4. Reward function.

The reward function helps guide the DRL model in making decisions that optimize traffic flow, including minimizing queue lengths, maximizing throughput, and preventing accidents. This reward function ensures that the agent not only optimizes traffic flow but also prioritizes safety by discouraging accidents and reducing congestion at intersections in the traffic environment given in Equation (5).

(5)

Where as the inferred queue length , and as inferred stops based on vehicle speed and count, the accident reported at is given as . The queue length at intersection at time , the traffic volume at intersection at time , with as the maximum acceptable number of stops. In the stop penalty function,denotes the number of stopped vehicles at intersectionat time , and the term represents the average vehicle speed. Here represents the risk proxy, not the trained probabilistic classifier output. It is deterministically derived using normalized queue length, mean speed variation, traffic volume, weather condition indicators, and the current signal phase at time t. The accident risk threshold determines the level of risk tolerance for safety penalties, which is a non-learnable hyperparameter. It uses visible traffic and environmental factors, such as normalized wait length, traffic density, speed fluctuations, weather conditions, and signal phase conflicts, to produce a closed-form risk measure. These traits are linearly concatenated and passed through a sigmoid function to generate a normalized value in [0,1] representing short-term accident risk, not a learning probability model.

The incentive weights (α (Queue penalty weight), β (Stop penalty weight), γ (safety/accident weight)) were determined through empirical sensitivity analysis rather than heuristic estimates. After factoring in congestion reduction, stop minimization, and safety penalties, a coarse grid search of permissible ranges was fine-grained to train stability and convergence rate with respect to policy strength. The chosen weights steadily improved performance without causing oscillatory signals or safety breaches. After identification, the weights were held constant across all intersections, traffic demand levels, and weather to ensure fair comparison and demonstrate the policy’s generalizability. Training and evaluation were not weighted per intersection or scenario-reward.

Fig 3a-3c reports that the accident acts as a primary reward signal and safety indicator for the PER mechanism. A value of 1 triggers a large negative reward, ensuring that the agent quickly learns to avoid accident-routine scenarios. The signal status from agents’ outputs determines the next signal phase, as transitioning from red to green indicates control over the environment. In this simulation, the agent adjusts the signal phases at a traffic intersection based on weather conditions, accident reports, and traffic volume. The three plots show how the agent’s choices affect traffic over time. For example, traffic volume varies with weather conditions and signal phases, peaking during rush hours, and accidents exacerbate the issue. Second, the average speed follows a similar trend, slowing down when there are accidents or adverse weather conditions, such as rain or fog. Finally, the reward function indicates how effectively the agent achieved these objectives. Higher payouts mean better traffic management and fewer accidents.

thumbnail
Fig 3. Traffic Control Analysis Over Time Across a) Traffic Volume, b) Average Speed, and c) Reward Function.

https://doi.org/10.1371/journal.pone.0339207.g003

TD3 trains a deterministic policy in an off-policy manner, where agents learn to explore on-policy features in the learning environment, with dynamic updates driven by various learning signals. The TD3 framework uses double Q functions via mean-squared Bellman error minimization, which differs from other learning models in how it updates the dimension of each action function. Adding the noise to each dimension of the function validates the penalty rewarded for learning and ensures that the target actions are taken. The target actions follow stability techniques to provide a valid action range and corresponding reward update.

3.2.5. Clipped double Q-learning.

The traffic signal control model is implemented by combining clipped double Q-learning and TD3 to optimize the agent’s decision-making, minimizing bias in Q-value estimates and stabilizing learning across traffic states during intersection analysis. TD3 utilizes two Q-functions, Q1 and Q2, and selects and evaluated at each decision point. The target Q-value for a transition is given as a sequential transition tuple, with the initial state , the executed action , intermediate reward , as the discount factor and the subsequent state and as a binary indicator for episode termination. The Q-value target is computed as in Equation (6),

(6)

Target policy ensures smoother policy updates and prevents overfitting to Q-values, which might lead to instability. With the gradual update of the target network, the action decisions are made by the actor, for instance, the green duration for signals, which allows for stability and avoids overreacting to transient traffic fluctuations, and as the noisy target action. To prevent the policy from exploiting sharp, high peaks in the Q-function, it is expressed as in Equation (7),

(7)

The function smoothes the decision-making process, helping the agent avoid overreacting to transient traffic changes. Here is a Gaussian noise and is a clipping bound, where is a random noise that smoothes and helps average out Q-value errors over similar actions. The term is the target action policy based on traffic features like weather and crash type, evaluated at the state . From the actor network for the state followed by implies the lower and upper bounds for current signal phase durations and traffic risk policies, such as safety measures.

Fig 4 illustrates the complete process of the TD3P-ITC framework, which is utilized to manage traffic signals in a manner that adapts to changing conditions. The first step is to generate traffic data and a state space from sensor inputs, including vehicle counts, queue length, and wait time. The TD3 agent uses a policy to choose actions that modify the timing of signals and then implements those changes. The experience collection phase collects new states and rewards, then places them in the PER buffer. After that, the agent’s experience is prioritized based on the TD error, and the critic and actor networks are then updated during training. The framework continually improves its signal control policy, which enables it to work more effectively over time.

3.2.6. Target network and delayed updates.

The target networks provide stability by slowly maintaining the rewards of both actor and critic networks that generate consistent learning target stages, which prevents the moving target problem in the earlier RL models using the soft update mechanism . Along with delayed policy updates, enhancing this stability involves updating the actor network every 𝑑 = 2 step while critics update continuously, allowing Q-value estimates to converge near their true values before policy modification. This plays an important role in traffic scenarios where premature signal changes can cascade through the network. The soft update rule for the target network is given by Equation (8).

(8)

The policy network or parameters of the target policy update and the current policy network is given as . The term and are the parameters of the target-function networks Q1 and Q2, with and are the parameters of the current Q-function networks, with as the update factor that controls the traffic. A small means the update is gradual, ensuring stable learning. The analyzes the minimum Q-value from two Q-functions given in Equation (9).

(9)

The term is the safety reward for safe driving behavior, 𝛾 as the decay factor controlling the importance of future rewards, and 𝑑 as a flag indicating whether the traffic incident has concluded (1) or not (0). The parameter is the two Q-functions are used to evaluate traffic risks.

3.2.7. Loss function.

For each Q function, calculate the predicted Q-values and the observed target value variations, to represent the traffic risk estimation using as target value, then square the differences to get the error terms for both Q-functions. For calculating the loss function, the squared errors over a from the PER is given in Equations (1012).

(10)(11)(12)

The term is the batch drawn from the PER learning, which includes traffic incident data such as vehicle counts, crash severity, and road conditions. For each Q function, the error term is analyzed using the Bellman formulation. The target value incorporates and the min of the next pair. The error terms are then squared and to penalize larger differences more heavily, and the loss for each Q-function is averaged. The final loss is minimized using gradient descent to make the TD3P-ITC decision, thereby improving the traffic signal control policy and estimating traffic risk through experience-based learning.

3.2.8. Actor network phase.

The actor network update enables the Q-function to converge closer to its true value before the policy is updated, leading to more stable learning. The represents the policy network parameters, as the learning rate for the policy network, and as the objective function for the policy network, each primary network policy and Q function is paired with a target network in Equation (13).

(13)

Each primary network is replicated in a target network and updated incrementally, providing stable prediction targets during model training. The actor network takes the and outputs the representation, which indicates the signal durations. The critic network provides future-state representation for real-time traffic control.

The input traffic environment listed in Fig.5 provides the state for multiple intersections with varying traffic densities. The critic network calculates the Q-values and through two separate estimations, then fed into the TD error to prioritize the experienced replay buffers in high-volume traffic. The actor network generates the optimal signal timing adjustments using current and pairs in association with target updates and and current policy updates . The use of PER enhances learning efficiency by focusing on critical traffic events, and the target network stabilizes updates, allowing for accurate traffic signal optimization in dynamic environments.

thumbnail
Fig 5. Real-time traffic signal optimization across multiple intersections.

https://doi.org/10.1371/journal.pone.0339207.g005

The system minimizes waiting times, reduces congestion, and improves throughput across multiple intersections in real-time. The TD3P-ITC training framework is defined below.

Algorithm: TD3P-ITC Framework Training Procedure

Data Required: state space

Output: Optimized policy , and Q functions

Initialize

1. ,

2. and targets

3. , priorities

4.

5. for E = 1 to max_episodes do

6. Init from traffic environment

7. for do

8.

9.  Update in environment

10.  update

11.  Calculate TD-error using

12.  Store

13.  if then

14.   Sample batch with PER probabilities

15.   for each in batch do

16.    calculate

17.    Compute target

18. end for

19. Update critics: and

20.

21.

22.

23.    Update actor:

24.    Soft update targets:

25. end if

26. end if

27. end for

28. end for

29. return

3.2.9. Prioritization experience replay.

The use of the PER technique ensures that the most critical experiences, which significantly impact traffic flow, are replayed and used to update the learning model. The priority of an experience is usually based on the absolute Temporal Difference error where is the TD-error and is a small positive constant ensuring all experiences have sampling probability. This PER samples experiences based on priority; higher-priority experiences are those that significantly affect traffic flow, and signal timings are used to update the model. The TD error is calculated as in Equation (14).

(14)

Where guides the adjustment of the signal timings to minimize delays and congestion while maximizing throughput. The and estimated by critic one and critic 2, respectively. The sampling probability for each experience is proportional to its priority given as As a hyperparameter that controls the overestimation bias towards higher-priority experiences, using these prioritized samples enables the model to update its Q-values, ensuring it focuses more on the most critical traffic control experiences. This helps refine the learning process, especially for actions that have a significant impact on traffic flow, such as managing high congestion or adjusting signal timings during heavy traffic.

Priority is determined through proportional prioritization to implement Prioritized Experience Replay (PER). The priority of each transition is calculated as where is the temporal-difference error of the critic networks, =10–6, prevents zero-priority, and =0.6 is the parameter used to interact with the degree of prioritization. The priorities are updated immediately after each learning update, using new calculated TD errors. The transitions are sampled based on to avoid the bias caused by the uneven sampling, the sample loss of the critic is corrected with importance-sampling weights , where is the size of a replay buffer. Practically, it is the exponent of the correction term that is linearly annealed between 0.4 and 1.0 during training, and the weights are normalized by to stabilize the optimization, which allows a reasonable balance between high-error transition exploration and an unbiased value estimate.

In Fig 6, PER accelerates learning by prioritizing critical traffic experiences with high temporal-difference errors, ensuring the system focuses on high-impact scenarios, such as emergency vehicle passages or congested areas, through probability-weighted sampling. This notable improvement in learning efficacy is achieved when employing PER within the TD3P-ITC framework. The TD3P-ITC with PER consistently outperforms the TD3 model without PER, as evidenced by the higher cumulative reward increase across training episodes. The shaded green region in Fig 6 illustrates the improvement achieved with PER, demonstrating its effectiveness in prioritizing experiences for enhanced learning. This study demonstrates the efficacy of PER-based reinforcement learning for controlling urban traffic signals. The cumulative reward for TD3P-ITC rises more quickly to higher levels, which means that policy optimization is more effective during training. Its practical significance lies in its ability to improve real-time traffic flow while ensuring safety, making it a strong candidate for future ITS applications.

In summary, the unique TD3P-ITC for real-time traffic signal optimization across multi-intersection networks provides an effective control of real-time traffic signal decisions. For stable learning in dynamic traffic situations, the system uses PER, which prioritizes essential traffic scenarios with high temporal-difference errors to efficiently use experience. The system employs dual normalization algorithms to handle a 12-feature state space, incorporating normalized traffic volume, inferred queue lengths, signal phases, and environmental conditions for context-aware decision-making across various traffic patterns. By balancing queue minimization, stop reduction, and accident prevention with weighted penalty terms (α, β, γ), a multi-objective reward function optimizes signal timing to achieve both safety and performance. Dynamic action-space controllers adjust the green signal phase based on queue thresholds and throughput measurements to regulate traffic flow within the modular architecture.

4. Results and discussion

The proposed TD3P-ITC research is conducted in a high-performance computing environment with multicore processors and GPUs for accelerated computation. Software tools include Python, for data processing, along with TensorFlow and PyTorch. The simulation is performed using the SUMO traffic simulation platform for real-time modeling. The system is equipped with 64GB of RAM and Nvidia RTX GPUs for efficient model training and evaluation. During the training period, the TD3P-ITC model’s GPU memory never exceeded 6 GB, and the test run converged to within 5,000 episodes in around 3.2 hours. To ensure reproducibility, all tests were conducted using fixed random seeds for the Python runtime, NumPy, and deep learning resources. The results provided should reflect the average performance after running many trials with different random seeds. Using realistic traffic demand distributions, vehicle composition, signal statuses, and environmental conditions, the Kaggle dataset is used to initialize and parameterize the traffic simulation environment. A SUMO interaction microscopic simulator employs a reinforcement learning agent. In SUMO, the agent is not taught using a static dataset; rather, it learns state transitions, rewards, and experiences through online interactions with its environment.

The SUMO microscopic simulator implemented a traffic simulation environment for a multi-intersection metropolitan network, including residential, business, highway, mixed-use, and transport-hub junctions. Time-varying flow profiles generated traffic demand and simulated peak and off-peak situations, while non-homogeneous Poisson processes predicted vehicle arrival. Because publicly available data does not provide actual origin-destination (OD) information, route choice was determined using predetermined probabilistic route distributions, which maintain realistic route diversity while controlling demand. Based on observed data, the car composition was sampled. Changes in vehicle dynamics parameters modeled weather influences showed that rain, fog, and wind increased speed decrease and acceleration noise, indirectly increasing queue building and accident probability. For a fair comparison, demand patterns, routing logic, and weather modifiers were maintained between the baseline and proposed techniques.

Table 3 lists the primary parameters and ranges or default settings of the TD3P-ITC traffic signal control model. Q-Network and Policy Network parameters determine learning rate (, ) and update frequency across both networks. During training, exploration parameters like and guide the agent’s exploration. The PER parameters and regulate sampling. Penalties for queue, halt, and safety change reward function components. The target network update rule and Q-Loss function guide the model’s learning and convergence criteria, ensuring the agent’s persistent improvement.

thumbnail
Table 3. Hyperparameter tuning settings for TD3P-ITC framework.

https://doi.org/10.1371/journal.pone.0339207.t003

The 2,000 observations are the size of the time-indexed traffic record set used to parameterize and establish the traffic simulation environment, not the reinforcement learning agent’s cumulative learning samples. In an interactive simulation environment, each episode is broken down into several sequential decisions, where the agent chooses when to take a signal and observes state transitions and rewards. This means that one episode yields many state-action-reward-next-state tuples, and the total number of experiences saved can be huge relative to the original dataset.

The replay buffer of 20,000 is the maximum number of interaction-induced transitions stored to learn off-policy, regardless of raw dataset size. Instead of storing 2,000 dataset observations, the replay buffer contained interaction-based transitions. The AI was trained in a simulated environment on traffic conditions using a single observation and executed signal control actions over multiple time steps per episode. Each action caused a state-action-reward-next-state transition, which was added to the replay buffer with each episode. The buffer capacity of 20,000 measures interaction-based experiences with multiple simulation rollouts, not the baseline dataset’s records. The convergence of fewer than 5,000 episodes suggests that the learnt policy utilized identical data samples during successive rollouts under varying traffic conditions and control actions. In classic off-policy actor-critic reinforcement learning, the richness and variety of interaction-based events determine learning progress over dataset cardinality.

The fixed dataset is used to bootstrap and parameterize a microscopic traffic simulation, not for offline learning. Through signal timing actions, the agent dynamically affects vehicle movement, queue evolution, and delay, producing new state transitions and rewards at each decision step. As with reinforcement learning, learning is interaction-induced, based on experiences accumulated throughout simulation rollouts, and is no longer limited by the quantity of raw dataset observations.

A simulator was constructed, and its parameters were set using the data, without using offline reinforcement learning. The dataset includes traffic demand profiles, statistics on vehicle composition, information on signal states, and ambient conditions that can be used to set up and run a microscopic simulation. In contrast to policy learning on fixed trajectories, this simulator enables typical interaction-based reinforcement learning by allowing the agent to interact online through control actions to generate new state transitions and rewards.

4.1. Traffic control simulation

The model utilizes PER and TD3 to focus on high-impact traffic situations, enabling faster learning (Fig 7a-Fig 7d).

thumbnail
Fig 7. Simulation Analysis of Traffic Control and Agent Actions Over Time a) Traffic Volume and Speed, b) Accident Risk and Occurrence, c) Signal Timing Adjustments, and d) Average Reward by Weather Condition.

https://doi.org/10.1371/journal.pone.0339207.g007

The traffic volume and average vehicle speed over time, with distinct traffic behaviors varying across signal phases. The accident risk and recurrence highlight moments when accidents occur due to environmental factors. The third plot illustrates decision-making for timing segments by showing the NS and EW signal phase extensions. The effect of weather conditions on reward values demonstrates how different weather patterns affect traffic control. TD3P-ITC is likely to work more effectively than other traffic control systems, such as SMART and M2SAC, as it learns various traffic conditions, accidents, and weather effects. It provides accurate traffic data, including the number of cars on the road, their speed, and the frequency of accidents. It also takes into account how weather conditions affect traffic. The agent changes the length of traffic lights based on these variables and random actions.

4.2. Comparative performance analysis

The selection of existing models SMART [18], FPPO [19], D3QN [22], and M2SAC [26] for comparison with the proposed TD3P-ITC model is strategically chosen to showcase its efficiency in real-time traffic signal control. The SMART [18] model is effective at reducing mean travel time; however, it incurs high computational costs and scalability issues when applied to large urban networks. TD3P-ITC addresses these issues through PER and clipped double Q-learning. FPPO [19], known for its stability during policy updates, lacks robustness to handle rapidly changing traffic conditions. This limitation is overcome by TD3P-ITC, which uses delayed updates and target policy smoothing to maintain stability while adapting to dynamic environments. D3QN [22] balances efficiency and safety, but it is prone to Q-value overestimation bias, leading to slightly longer waiting times. M2SAC [26] faces limitations in complex traffic environments due to its multi-agent approach. In contrast, the proposed TD3P-ITC enhances traffic flow stability, scalability, and efficiency, necessitating adaptive, real-time traffic signal control to mitigate congestion and reduce waiting times.

All baseline methods—SMART, FPPO, D3QN, and M2SAC—were reimplemented in the same simulation environment to ensure a controlled, fair comparison. The hyperparameter settings in both models were derived from the original methodological settings described in the reference papers. These settings included schedules for learning rate, discounting, target network update, exploration or entropy regularization, and replay buffer, among others. The models’ control structures were kept unchanged from their original formulations. For example, FPPO used centralized critics, D3QN used duelling architectures, M2SAC used entropy-regularized actor-critic updates, and SMART used rule-based actuation logic. To ensure accurate performance results, only parameters affected by the environment were considered, such as the state dimensionality and action limitations. No further hyperparameter modification was employed.

Each baseline approach (SMART, FPPO, D3QN, and M2SAC) was evaluated under identical simulation settings to ensure a fair, controlled comparison. In particular, each model had the same SUMO network topology, traffic demand model, vehicle makeup, routing logic, weather, and signal safety constraints. Traffic generation, route assignment, and the simulator’s stochasticity used the same random seed. Minor implementation adjustments were made to align the state dimensionality, action boundaries, and decision intervals with the standard experimental setup. No algorithm-specific performance tuning or incentive reparameterization was done. The uniform evaluation approach would ensure that learning and control strategies, not environmental and implementation biases, explain performance differences.

4.2.1. Average waiting time.

The plot shown in Fig 8a-Fig 8d displays the variation in waiting times for each model under different traffic volumes, with the box showing the interquartile range and the median line indicating the central tendency. The markers represent the range of waiting times, and the outliers are indicated. The adjusted waiting times suggest that the combination of traffic flow dynamics and model performance is crucial for optimizing real-time traffic signals. The average waiting time for five different traffic signal control models (SMART, FPPO, D3QN, M2SAC, and TD3P-ITC) under four different traffic conditions: Low Traffic (50 veh/h), Moderate Traffic (300 veh/h), High Traffic (600 veh/h), and Very High Traffic (998 veh/h). The box color indicates the traffic state, and each boxplot shows the average wait time for a specific situation. The data is simulated with noise to make it more realistic, and the traffic-volume scaling factors ensure that waiting times increase as traffic density rises. The graphs show the median, IQR, and outliers, making it easy to compare how well each model did in different situations. This picture illustrates how the TD3P-ITC model compares to other models in terms of reducing wait times, particularly when traffic conditions change.

thumbnail
Fig 8. Average Waiting Time for Different Traffic Signal Control Models Across Various Traffic Scenarios a) Low Traffic (50 Veh/h), b) Moderate (300 Veh/h), c) High Traffic (600 Veh/h), and d) Very High Traffic (998 veh/h).

https://doi.org/10.1371/journal.pone.0339207.g008

4.2.2. Throughput efficiency analysis.

The TD3P-ITC model achieves superior performance across various weather conditions, intersection types, and vehicle types, as shown in Table 4, due to its novel integration of TD3 with PER, which adapts traffic flow and safety. The 4.0% improvement over existing models M2SAC, D3QN, FPPO, and SMART is calculated based on throughput enhancement. Variant 1: Weather Conditions shows that TD3P-ITC consistently outperforms other models, with a 4.0% improvement in sunny weather and better handling of rain and fog. Variant 2: Intersection Types shows that TD3P-ITC performs best on highways and at transit hubs, where it can handle significantly more traffic. Finally, Variant 3: Vehicle sorts shows that TD3P-ITC performs well across cars, trucks, and buses, and outperforms the other versions across all vehicle types. The TD3P-ITC model adapts more effectively and performs better across varied traffic conditions, as evidenced by higher throughput and improved management of environmental and traffic complexity. By employing stability techniques, the framework reduces the likelihood of overestimation. These processes enable TD3P-ITC to make more informed and stable decisions about traffic signals, particularly when traffic is complex. The PER enhances learning efficiency by prioritizing experiences with a significant impact, such as heavy traffic or situations where accidents are more likely. This helps people learn more quickly and adapt more effectively to various traffic conditions.

thumbnail
Table 4. Throughput comparison of traffic signal control methods.

https://doi.org/10.1371/journal.pone.0339207.t004

4.2.3. Queue length minimization.

The illustrated Fig 9a-Fig 9e shows the lengths of car queues at five distinct types of intersections: Residential, Commercial, Highway, Mixed-Use, and Transport Hub. It utilizes multiple traffic signal control algorithms, including TD3P-ITC, SMART, FPPO, D3QN, and M2SAC.

thumbnail
Fig 9. Queue Length Analysis on Various Traffic Intersections: a) Residential, b) Commercial, c) Highway, d) Mixed-Use, and e) Transport Hub.

https://doi.org/10.1371/journal.pone.0339207.g009

The results show that TD3P-ITC outperforms other models at reducing queues, especially in high-traffic areas such as highways and transport hubs. The Q-value update rule ensures the model continuously refines its decision-making, resulting in more efficient traffic flow. The changes in queue lengths over time demonstrate how each algorithm responds to varying traffic conditions. TD3P-ITC keeps queues that are substantially shorter and more stable. These results align with the objectives of real-time traffic signal control research, which seeks to enhance traffic flow and mitigate congestion. From this analysis, a 22% reduction in peak queue lengths at the Transport Hub intersection and a 25% reduction at the highway intersection are demonstrated, while managing high-traffic environments by optimizing traffic signal control and minimizing congestion.

4.2.4. Training model convergence analysis.

During training, the agent receives rewards based on how effectively it reduces congestion, optimizes traffic flow, and minimizes vehicle waiting time. After several iterations, called training episodes, the TD3P-ITC model converges to the optimal signal timings expressed as for lane . The model selects optimal timing settings that maximize throughput, minimize queue lengths, and reduce average waiting times in intersection scenarios, thereby effectively enhancing real-time traffic flow. During the inference shown in Fig. 10a and Fig. 10b, the trained policy network provides real-time traffic signal control decisions for each intersection, dynamically optimizing signal timing based on observed traffic conditions.

thumbnail
Fig 10. Training Performance Analysis: a) TD3P-ITC Model and b) Convergence.

https://doi.org/10.1371/journal.pone.0339207.g010

4.2.5. Analysis with baseline models.

Table 4 presents the evaluation of traffic signal control algorithms, examining the impact of SMART [18], FPPO [19], D3QN [22], and M2SAC [26] on key evaluation metrics, including wait time, throughput, queue length, and accident rate. The differences between the traffic signal control algorithms. p < 0.001**. These markers help determine how reliable performance increases are by examining factors such as wait time, throughput, and accident rate. In traffic signal control, lower p-values (e.g., p < 0.001*) indicate a higher likelihood that the algorithm is responsible for the observed performance benefits (e.g., shorter wait times, reduced congestion) rather than random chance. This helps ensure that the algorithm operates effectively and reliably in real-life traffic management.

The traffic simulator reduced simulated accident incidences by 17.9% compared to baseline controllers under the same traffic demand and environmental conditions. Accident counts are recorded when a simulator’s safety model flags a collision or close contact. This measure is based on the empirical number of accident incidents per episode averaged over evaluation runs, not cumulative reward or punishment magnitude. Table 5 was statistically tested using two-tailed independent t-tests on the average per-episode metrics, with n = 50 evaluation episodes per method. The degrees of freedom for the waiting time, throughput, queue length, and accident rate were df = 98 in all paired comparisons. The related test values ranged from t = 3.12 to t = 6.47, and the effect size (Cohen d) ranged from 0.62 to 1.28, indicating medium to large practical significance. A Bonferroni correction was applied to a series of multiple pairwise comparisons of the four baselines and 4 performance metrics (K = 16), yielding an adjusted significance level of 0.0031.

thumbnail
Table 5. Statistical significance analysis (Two-tailed t-test Results).

https://doi.org/10.1371/journal.pone.0339207.t005

4.2.6. Complexity analysis.

The computational complexity analysis provides that the TD3P-ITC system achieves better performance with space complexity and inference time complexity of , enabling real-time operation with faster decision-making. The state dimension indicates the number of features per intersection with queue length, vehicle counts, waiting times, and current signal phases and duration. The action dimension suggests the number of control signal parameters for phase timing adjustments. With five intersection points, 20 state features, and 256 hidden neurons with layers , the memory buffer remains manageable. Training complexity scales per as batch size for training termed as during each training time step with PER sampling efficiency of with memories range as experiences. Communication overhead ranges from the number of intersections in the traffic network as for coordination to for centralized approaches. The heatmap in Fig 11 shows the performance of five traffic signal control models across key metrics, including waiting reduction, travel time score, flow efficiency, performance score, and learning stability. TD3P-ITC achieves the best waiting reduction (42.8%) as a cumulative performance gain across benchmark and travel time scores (97), but it also exhibits the lowest learning stability (0.023). On the other hand, M2SAC reduces waiting time by 15.2%. However, it is 27.6% behind TD3P-ITC in terms of waiting reduction and 3% behind in travel time score, utilizing the clipped double Q learning stability technique. In terms of flow efficiency, the existing FPPO and D3QN perform better in training, while TD3P-ITC does a better job at reducing waiting time and improving performance scores. The traditional method, SMART, has the worst results, with a 22.0% decrease in waiting time and lower scores in other areas. TD3P-ITC outperforms SMART in terms of waiting reduction (20.8%) and travel time Score (13.5%), indicating its effectiveness in optimizing traffic signals in real time.

In addition to the asymptotic complexity study, an empirical timing analysis was conducted to assess real-time viability and to infer recent deployment. The taught TD3P-ITC policy decides at each intersection in 3–5 ms on conventional GPU processors and in less than 20 ms on a CPU-only edge traffic controller model, which is suitable for real-time operation with a normal signal update rate (>1s). The offline training takes 3.2 hours to converge over 5,000 episodes; however, the online operation remains unaffected. The computational delay of inference scales linearly with the intersection number; therefore, lightweight actor networks can execute inference in real time with fixed dimensionality. The proposed system can be deployed on standard traffic management equipment, with central or offline learning and low-latency inference at runtime.

The continuous-control Markov Decision Process for traffic signal control reduces vehicle waiting time and queue length, and ensures traffic safety under dynamic, uncertain traffic conditions. Actor-critic algorithms are excellent for this challenge because the model directly optimizes continuous green-phase timings, unlike discrete-action models, which pick a time phase from a timing schedule. The bias toward overestimation in the continuous action space, introduced by clipping and updating, and the delayed policy updates in double Q-learning influence the use of TD3. The instability of value estimates causes oscillatory signal behaviour and cascading intersection congestion, making such qualities useful in traffic signal regulation. It is used with PER to balance traffic with high-impact, low-frequency events, such as congestion peaks and high-risk accidents. Sample transitions based on temporal-difference error to highlight states that have the greatest impact on network-level performance, thereby improving sample efficiency and convergence stability.

The state representation covers instantaneous traffic conditions and contextual elements that affect signal operation. Congestion is measured by inferred queue length and normalized traffic volume, while signal-phase encoding ensures policy-awareness during the learning phase. Weather affects vehicle speed and accident risk. Normalising features ensures numerical stability during training and prevents value-function approximation bias towards large inputs. To balance efficiency and safety, the rewarding function is a weighted multi-objective signal. Stop-related penalties and wait length reduce congestion and improve traffic flow, and accidents will deter hazardous signal layouts. This formulation avoids hard-coded traffic algorithms and allows the agent to learn adaptive trade-offs as it operates in the system. To guarantee scale consistency across rewards and safety breach penalties compared to efficiency advantages, weighting coefficients are established. The suggested controller is tested in a tiny simulated traffic environment with varied traffic, weather, and vehicle compositions over a multi-intersection network. In the same conditions, the simulated environment enables controlled analysis of policy behaviour and benchmarking against current methodologies. This research examines learning dynamics and control stability at a small scale across a few crossings; however, it lacks structural limits that would prevent its application to larger networks. The methodology defines a clear relationship among control goals, the learning architecture, and the evaluation protocol, enabling performance gains to be attributed to the proposed TD3P-ITC design rather than to implicit assumptions or heuristic adjustments.

5. Conclusion

The presented TD3P-ITC framework represents a significant step towards fully autonomous ITS that can adapt to dynamic conditions while ensuring optimal safety and efficiency outcomes. The fusion of TD3’s stability mechanisms with PER’s learning policy creates a robust framework that is optimized for safety-critical traffic applications. The multidimensional aspect of the TD3P-ITC module addresses specific aspects of traffic control, from state processing to ensuring data quality and dynamic action controllers managing real-time signal timing adjustments. The TD3P-ITC framework significantly advances intelligent traffic control by employing DRL for dynamic signal optimization. The results demonstrate that a 4% improvement in throughput efficiency and a 9.6% reduction in wait time are achieved by efficiently learning optimal traffic signal timings through state and dynamic action space representation and a reward function. The Q-value stability and delayed policy updates help avoid overestimation of signal benefits, ensuring smoother traffic signal control. This results in 14% faster convergence and 7.2% higher sample efficiency, validating its practical implications for real-world traffic systems. At high-demand crossings, experimental results reveal a considerable reduction in wait length (up to 25%) and a drop in experimentally-emulated accident rates of 17.9%. Limited data samples may not capture long-term traffic patterns that impact seasonal analysis, potentially leading to overfitting to specific periods and a limited variety in accident types and severity analysis. The framework’s modular design offers flexibility for future expansion, including large-scale urban validation and multi-modal integration.

References

  1. 1. Tan X, Zhou Y, Jiao X. Traffic signal control based on deep reinforcement learning using state fusion and trend reward. Engineering Applications of Artificial Intelligence. 2025;159:111701.
  2. 2. Wang L, Zhang G, Yang Q, Han T. An adaptive traffic signal control scheme with Proximal Policy Optimization based on deep reinforcement learning for a single intersection. Engineering Applications of Artificial Intelligence. 2025;149:110440.
  3. 3. Bálint K, Tamás T, Tamás B. Deep reinforcement learning based approach for traffic signal control. Transportation Research Procedia. 2022;62:278–85.
  4. 4. Huang Z. Reinforcement learning based adaptive control method for traffic lights in intelligent transportation. Alexandria Engineering Journal, 2024;106:381–91.
  5. 5. Haydari A, Yılmaz Y. Deep reinforcement learning for intelligent transportation systems: A survey. IEEE Transactions on Intelligent Transportation Systems, 2020;23(1):11–32.
  6. 6. Li Z, Yu H, Zhang G, Dong S, Xu CZ. Network-wide traffic signal control optimization using a multi-agent deep reinforcement learning. Transportation Research Part C: Emerging Technologies. 2021;125:103059.
  7. 7. Xu D, Liao X, Yu Z, Gu T, Guo H. Robustness enhancement of deep reinforcement learning-based traffic signal control model via structure compression. Knowledge-Based Systems. 2025;310:113022.
  8. 8. Pang A, Wang M, Chen Y, Pun MO, Lepech M. Scalable reinforcement learning framework for traffic signal control under communication delays. IEEE Open Journal of Vehicular Technology. 2024;5:330–43.
  9. 9. Xu T, Pang Y, Zhu Y, Ji W, Jiang R. Real-time driving style integration in deep reinforcement learning for traffic signal control. IEEE Transactions on Intelligent Transportation Systems. 2025;26:11879–92.
  10. 10. Yang T, Fan W. Transit signal priority under connected vehicle environment: Deep reinforcement learning approach. Journal of Intelligent Transportation Systems. 2025;29(5), 505–17.
  11. 11. Liu Y, Liang J, Zhang Y, Gong P, Luo G, Yuan Q, Li J. GlobalLight: Exploring global influence in multi-agent deep reinforcement learning for large-scale traffic signal control. Neurocomputing. 2025;637:130065.
  12. 12. Swapno SMMR, Nobel SMN, Meena P, Meena VP, Azar AT, Haider Z, et al. A reinforcement learning approach for reducing traffic congestion using deep Q learning. Sci Rep. 2024;14(1):30452. pmid:39668197
  13. 13. Kolat M, Kővári B, Bécsi T, Aradi S. Multi-agent reinforcement learning for traffic signal control: A cooperative approach. Sustainability. 2023;15(4):3479.
  14. 14. Yang T, Fan W. Enhancing robustness of deep reinforcement learning based adaptive traffic signal controllers in mixed traffic environments through data fusion and multi-discrete actions. IEEE Transactions on Intelligent Transportation Systems. 2024;25(10):14196–208.
  15. 15. Bouktif S, Cheniki A, Ouni A, El-Sayed H. Deep reinforcement learning for traffic signal control with consistent state and reward design approach. Knowledge-Based Systems. 2023;267:110440.
  16. 16. Zhang M, Wang D, Cai Z, Huang Y, Yu H, Qin H, Zeng J. EGLight: enhancing deep reinforcement learning with expert guidance for traffic signal control. Transportmetrica A: Transport Science. 2025:1–27.
  17. 17. Wu Q, Wu J, Shen J, Du B, Telikani A, Fahmideh M, Liang C. Distributed agent-based deep reinforcement learning for large scale traffic signal control. Knowledge-based systems. 2022;241:108304.
  18. 18. Fereidooni Z, Palesi LAI, Nesi P. Multi-agent optimizing traffic light signals using deep reinforcement learning. IEEE Access. 2025;13:106974–106988.
  19. 19. Li M, Pan X, Liu C, Li Z. Federated deep reinforcement learning-based urban traffic signal optimal control. Sci Rep. 2025;15(1):11724. pmid:40188158
  20. 20. Wang T, Zhu Z, Zhang J, Tian J, Zhang W. A large-scale traffic signal control algorithm based on multi-layer graph deep reinforcement learning. Transportation Research Part C: Emerging Technologies. 2024;162:104582.
  21. 21. Bouktif S, Cheniki A, Ouni A, El-Sayed H. Parameterized-action based deep reinforcement learning for intelligent traffic signal control. Engineering Applications of Artificial Intelligence. 2025;159:111422.
  22. 22. Zhang G, Chang F, Jin J, Yang F, Huang H. Multi-objective deep reinforcement learning approach for adaptive traffic signal control system with concurrent optimization of safety, efficiency, and decarbonization at intersections. Accid Anal Prev. 2024;199:107451. pmid:38367397
  23. 23. Cao K, Wang L, Zhang S, Duan L, Jiang G, Sfarra S, Jung H. Optimization control of adaptive traffic signal with deep reinforcement learning. Electronics. 2024;13(1):198.
  24. 24. Cai C, Wei M. Adaptive urban traffic signal control based on enhanced deep reinforcement learning. Sci Rep. 2024;14(1):14116. pmid:38898047
  25. 25. Bao J, Wu C, Lin Y, Zhong L, Chen X, Yin R. A scalable approach to optimize traffic signal control with federated reinforcement learning. Sci Rep. 2023;13(1):19184. pmid:37932347
  26. 26. Zhang X, Chan LS, Nassir N, Sarvi M. Towards fair lights: A multi-agent masked deep reinforcement learning for efficient corridor-level traffic signal control. Communications in Transportation Research. 2025;5:100203.
  27. 27. Yazdani M, Sarvi M, Bagloee SA, Nassir N, Price J, Parineh H. Intelligent vehicle pedestrian light (IVPL): A deep reinforcement learning approach for traffic signal control. Transportation research part C: emerging technologies. 2023;149:103991.
  28. 28. Zhao J, Yao T, Zhang C, Shafique MA. Signal control for overflow prevention at intersections using partial connected vehicle data. Transportmetrica A: Transport Science. 2024:1–31.
  29. 29. Shi Y, Guler SI, Zhao J, Zhu J, Yang X. Increasing signalized intersection capacity with flexible lane design. Transportation Research Part C: Emerging Technologies, 2025;173:105054.
  30. 30. Jalil K, Xia Y, Zhao J. Advancements in collision avoidance techniques for internet-connected vehicles: A comprehensive review of methods and challenges. Engineering Applications of Artificial Intelligence, 2025;160:111836.
  31. 31. Dataset. https://www.kaggle.com/datasets/smmmmmmmmmmmm/smart-traffic-management-dataset