Figures
Abstract
Deep reinforcement learning has achieved significant success in complex decision-making tasks. However, the high computational cost of policies based on deep neural networks restricts their practical application. Specifically, each decision made by an agent requires a complete neural network computation, leading to a linear increase in computational cost with the number of interactions and agents. Inspired by human decision-making patterns, which involve reasoning only on critical states in continuous decision-making tasks without considering all states, we introduce the LazyAct algorithm. This algorithm significantly reduces the number of inferences while preserving the quality of the policy. Firstly, we incorporate a state skipping branch into the actor network to bypass states with minimal impact. Subsequently, we establish optimization objectives for single-agent and multi-agents inference, incorporating cost constraints based on the IMPALA and MAPPO frameworks, respectively. Finally, we utilize pre-training and fine-tuning techniques to train the policy network. Extensive experimental results indicate that LazyAct reduces the number of inferences by approximately 80% and 40% in single-agent and multi-agents scenarios, respectively, while sustaining comparable policy performance. The inferences reduction significantly decreases the time and FLOPs required by the LazyAct algorithm to complete tasks. Code is available here https://www.dropbox.com/scl/fo/wyoqo6q9gyt86zobfgbvx/h?\rlkey=0moyxsnoiisfs9y4h89hsou1l&dl=0.
Citation: Zhang H, Chen Z, Deng H, Feng C (2025) LazyAct: Lazy actor with dynamic state skip based on constrained MDP. PLoS ONE 20(2): e0318778. https://doi.org/10.1371/journal.pone.0318778
Editor: Manoharan Premkumar, Dayananda Sagar College of Engineering, INDIA
Received: June 28, 2024; Accepted: January 21, 2025; Published: February 6, 2025
Copyright: © 2025 Zhang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the manuscript and its Supporting information files.
Funding: This work was supported by the Sichuan Science and Technology Program (No. 2022NSFSC0552). The funders played a role in the study by providing support for decision to publish.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Deep reinforcement learning (DRL) has achieved remarkable success in complex sequential decision-making tasks by integrating neural networks with reinforcement learning techniques. This success spans various domains, including Go [1], autonomous driving [2], and large language models [3, 4], etc. In DRL, agents perceive environmental states and predict optimal actions iteratively to maximize the cumulative reward associated with a given task. Consequently, the computational cost of neural networks scales linearly with the number of interactions between agents and their environment. In multi-agents scenarios, each agent must compute policy decisions for every state, leading to a computational cost that increases proportionally with the number of agents. Reducing the overall inference cost while preserving policy quality is a crucial challenge for the practical deployment of DRL.
To address the issue of high inference costs, researchers have devised several methods to accelerate prediction processes. These methods can be broadly categorized into two main groups. The first group encompasses neural network compression techniques, including neural network pruning [5–7], weight quantization [8–10], and knowledge distillation [11–13]. The second group involves dynamic neural networks [14], which allocate computing resources adaptively according to the complexity of the data. Examples of such networks include MSDNet [15], which dynamically selects computation branches based on data complexity; GFNet [16], which adjusts image resolution based on local importance; and S2DNAS [17], which automatically searches for optimal dynamic network structures. Dynamic neural networks can accelerate speed even further by integrating these compression techniques. However, it is crucial to note that these solutions still require computation for each sample, and there remains potential for reducing prediction costs in tasks that involve sequential decision-making.
Drawing inspiration from human decision-making, agents can reduce inference costs by not making predictions at every state. We experiment the autonomous driving and SMAC tasks. It illustrates an autonomous driving agent that can repeatedly execute the same action on road segments with no vehicles, thus avoiding the risk of collisions. Similarly, SMAC task presents a scenario of multi-agents cooperative attack, where a specific marine can repeatedly perform the same action without impacting the collective combat. This repetition of actions is analogous to skipping intermediate states. To leverage this characteristic, we introduce the LazyAct algorithm in this work, which enhances the original policy network with a state skip branch. This branch allows for the execution of repeated actions in trivial states, bypassing policy computations for these states. This paper draws on the concept of options and Semi-Markov Decision Processes (semi-MDPs), where a sequence of actions over a duration is considered as a high-level action, such as options [18, 19]. The LazyAct proposed here is a simplified variant of options, where an option is represented as repeating the first action k times. Moreover, instead of predicting option termination at each state using a terminal function, we predict the duration of the option during action execution. This method reduces the prediction cost of neural networks and simplifies the optimization objective for constrained options. For single agent, our goal is to maximize the cumulative reward of the task while constraining the number of policy computations needed for completion. We formulate the task as a constrained Markov Decision Process (MDP) problem to derive the optimization objective and proceed with policy learning accordingly. Specifically, we adapt the constrained MDP for the IMPALA [20] algorithm to handle policy learning in single-agent tasks. In the context of multi-agents systems, we constrain the number of decisions in each state. And we maximize the cumulative reward of the task. In particular, we extend the constrained MDP derivation to the MAPPO [21] algorithm, addressing policy learning for multi-agents tasks. Furthermore, we enhance learning efficiency and policy quality through a combination of unconstrained policy pre-training and constrained policy fine-tuning.
In this work, We have made the following contributions:
- We have designed the LazyAct algorithm tailored for single agent, building upon the IMPALA framework. It is designed to maximize cumulative rewards while enabling it to skip states that are unimportant.
- For multi-agents scenarios, the LazyAct algorithm, grounded in the MAPPO framework, is crafted to maximize the joint reward while respecting constraints on the number of agent decisions per state. To further optimize the training process, we incorporate pre-training and fine-tuning strategies to enhance overall efficiency.
- Extensive experiments have been conducted in both single-agent and multi-agents tasks. The findings show that LazyAct achieves computational savings of 80% in single-agent scenarios and 40% in multi-agents contexts, respectively, while preserving policy performance at a comparable level.
Related work
Neural network acceleration
The objective of neural network pruning is to eliminate redundant connections between neurons in the neural network structure. Structured pruning can effectively reduce computational complexity, and researchers have proposed various pruning algorithms targeting convolutional neural networks, including the removal of channels or convolutional kernels [5–7]. Neural network parameter quantization techniques aim to limit the representation accuracy of weights. For instance, compressing 32-bit float parameters into 16-bit float parameters halves the overall network size and achieves doubled computational speed on GPUs. Global quantization selects the minimum numerical precision that maintains accuracy to replace the original values, and further improves accuracy through retraining [22, 23]. DeepCompression [24] employs clustering to divide weights into several clusters, discretizes the weights into several values based on the cluster centers, remarkably reducing the storage space for weights. The motivation behind local quantization is that different layers should adopt parameters with different precisions, such as maintaining high precision for weights closest to the input layer and low precision for weights close to the output layer [25]. Lee et al. proposed an element-wise gradient scaling training algorithm to address the issue of accuracy degradation caused by the direct use of the straight-through estimator (STE) in parameter quantization, aiming to enhance the model’s prediction accuracy while maintaining low-precision parameters [26]. Jorn et al. introduced QBitOpt, which can generate mixed-precision networks with high task performance while ensuring strict resource constraints, outperforming fixed-precision methods [27].
In addition to the aforementioned static compression algorithms, dynamic neural networks have emerged in recent years, which aim to reduce inference costs by assigning different computational paths to different samples. Huang et al. proposed the efficient MSDNet, an image classification model that judges the sample difficulty based on the input image’s label and selects features from different layers for prediction. When the predicted value of a certain category exceeds a threshold, it is output; otherwise, it proceeds to subsequent feature processing. MSDNet significantly reduces the average prediction cost while maintaining prediction accuracy. To address the challenge of low-level features classification, a multi-scale network structure was designed [15]. Wang et al. discovered that specific regions in images are more critical for recognition tasks. Based on this finding, they proposed the GFNet model, which utilizes reinforcement learning algorithms to lock into important recognition regions for prediction and determines dynamic prediction thresholds based on sample labels. Due to the smaller prediction region, it significantly reduces prediction costs while aligning more closely with human recognition behavior [16]. Cheng et al. leveraged neural architecture search techniques to automatically construct a sample-aware dynamic neural network, which exhibits higher inference efficiency and prediction accuracy compared to manually designed dynamic models [28]. Cui et al. introduced Brainstorm, a deep learning framework for optimizing dynamic neural networks. By unifying the expression of dynamics, Brainstorm fills the gap and through its proposed dynamic optimization, can increase the speed of popular dynamic neural networks by up to 11 times [29].
The accelerated inference idea of this topic is similar to that of dynamic neural networks. Most of the aforementioned dynamic neural networks are aimed at image recognition, and reduce inference costs by controlling resolution, recognition areas, etc. However, they cannot cope with non-image scenarios. In non-image scenarios, this paper refers to the idea of dynamic inference, and only performs inference on critical states to reduce total costs.
Action repeat technology
There exists a wide range of research areas within RL that focus on addressing the challenge of determining the optimal control frequency, as well as exploring the subject of action repetition. Typically, the motivations behind the repetition of actions vary, such as exploration, improving the signal-to-noise ratio, and managing sample complexity. Despite these diverse reasons, the methodologies employed to tackle these issues share remarkable similarities. These approaches often involve balancing the trade-offs between computational efficiency, accuracy and robustness, aiming to optimize performance by carefully selecting when and how often to repeat certain actions. Alex et al. demonstrate that establishing an appropriate frame skip can be pivotal for the performance of agents trained to play Atari 2600 games [30]. Across all six games, the frame skip proved to be a significant factor in achieving success. Notably, for two of these games, a substantial frame skip resulted in best performance. Adil et al. propose an examination of how the frame skipping rate affects the agent’s learning process and ultimate performance, specifically exploring its impact through the application of deep Q-learning, experience replay memory, and the utilization of the ViZDoom Game AI research platform [31]. Alberto et al. introduce the concept of action persistence, which involves repeating an action for a set number of decision steps, thereby altering the control frequency [32]. Initially, they examine how action persistence impacts the efficacy of the optimal policy. Following this, they introduce a new algorithm called PFQI, which is an extension of FQI, aimed at learning the optimal value function for a specific level of persistence. Sharma et al. introduce a unique framework, Fine Grained Action Repetition (FiGAR), that grants the agent the capability to choose both the action and the frequency of its repetition [33]. By facilitating temporal abstractions in the action space, FiGAR can enhance any Deep Reinforcement Learning algorithm that relies on an explicit policy estimate. Andre et al. present a proactive approach where the agent not only picks an action in a given state but also determines the duration of commitment to that action [34]. The method incorporates skip connections between states and learns a skip-policy to repeat the same action across these skips. Sabbioni et al. have devised a unique operator, named the All-Persistence Bellman Operator, which enables efficient utilization of both low-persistence experiences through sub-transition decomposition and high-persistence experiences due to the implementation of an appropriate bootstrap method [35].
The above mentioned action repeat method mainly focuses on single agent and usually aims to maximize cumulative rewards without constraints. In this paper, the purpose of action repeat is to reduce inference costs and ensure the quality of strategies, thereby modeling it as a constrained RL problem. Furthermore, we propose a complete solution for both single-agent and multi-agents scenarios.
Methodology
Overview
Fig 1 illustrates the framework of LazyAct. Each policy network comprises state skip and action branches, wherein the action output is based on both the skip and the current state.
The actor determines and stores the corresponding actions and skip duration for multi-agents within the environment in the Action Queue.
Initially, the agent determines the skipped length by analyzing the present state. Following this, the policy network utilizes both the skipped length and the current state to identify the appropriate action. This sequence of actions is then sent to the environment for implementation. Furthermore, the value network integrates a cost value branch and a state value branch. The former evaluates the computational expenses linked to the current state, whereas the latter assesses the overall accumulated reward. The actor determines and stores the corresponding actions and skip duration for multi-agents within the environment in the Action Queue. Specifically, when an agent’s action queue is depleted, the environment forwards the most recent observation to the agent and computes the upcoming action list. The state, observations and rewards are subsequently employed to refine the policy network, adhering to the principles of the constrained MDP framework.
LazyAct for single-agent based on IMPALA
For single-agent tasks, we devise the LazyAct algorithm, which is grounded in the IMPALA framework. IMPALA stands for a distributed deep reinforcement learning training framework, distinguished by its segregation of Actors and Learner roles. Actors engage with their respective environments across multiple servers, relaying experience data to the central Learner. The Learner updates the policy network parameters and disseminates the latest parameters back to the Actor servers. Notably, IMPALA introduces the V-trace algorithm, which addresses off-policy issues and enhances the efficiency of policy learning.
Formal problem definition.
We formulate LazyAct as a constrained policy optimization problem, as expressed in Eq (1). Here, τ denotes the interaction trajectory, and γ, belonging to the interval [0, 1], is the discount factor. Rt signifies the reward at time step t. Specifically, indicates whether to skip the state st, with kt denoting the skip decision output. If
, it implies that state st is skipped. ϵ and
represent the skip constraint and episode length, respectively. The constraint ensures that the average skip ratio across completed tasks is greater than or equal to the predefined value ϵ.
Utilizing Lagrange multipliers, we convert the constrained optimization problem presented in Eq (1) into an unconstrained one, as depicted in Eq (2). Here, θ and α denote the neural network parameters and the Lagrange multiplier, respectively. θ* and α* correspond to the optimal solutions to the transformed problem.
To simplify the problem, we scale the constraints in Eq (1) with a discount factor γ, as shown in Eq (3).
We formulate the optimization objective, as detailed in Eq (4), and subsequently divide it into policy and multiplier optimization components. Specifically, the policy and multiplier optimizations are conducted iteratively. In the policy optimization phase, we augment the original reward with skip rewards to incentivize the skipping of certain states. The multiplier optimizer is employed to ensure constraint satisfaction. When the skip ratio meets the ϵ constraint, α is set to 0; otherwise, α tends towards positive infinity.
It is equivalent to giving a penalty for not skipping. Because , where the
can be substituted by penalty term. We use Ct to represent the penalty term, that is, Ct = 0 if
, otherwise Ct > 0. The policy target is defined as Eq (5).
LazyAct with V-trace.
We extend the policy target in Eq (4) to the version of V-trace, which is the core of IMPALA. The original state value function of V-trace is defined as Eq (6).
The V-trace is computed recursively, as defined in Eq (7). Here, ρt and ct denote the truncated importance sampling factors. The symbols π and μ refer to the current policy and the actor’s policy, respectively, with μ being a delayed or outdated policy. Additionally, V(st) and C(st) represent the value function and the cost function, respectively, as illustrated in Fig 1.
Based on the value estimate of V-trace, we define the policy gradients of skip and action branch as Eq (8).
Multiplier optimization.
Drawing from the definition provided in Eq (4), we derive a linear update rule for optimizing α, as outlined in Eq (9). Here, η represents the learning rate for α. This implies that when the skip ratio exceeds the threshold ϵ, α is gradually reduced towards 0, whereas if the skip ratio falls short of ϵ, α is incrementally increased to penalize the insufficient skipping of states.
However, the frequent updates of α can lead to instability in the policy’s advantage function adv, which in turn affects the convergence of LazyAct. To address this issue, we establish a update frequency ratio of 1 : 100 for policy and multiplier updates. This means that α is updated once for every 100 rounds of policy updates, ensuring a more stable learning process.
LazyAct for multi-agents based on MAPPO
In multi-agent systems, planning, consideration and executing are key steps that constitute the process of decision-making. This paper primarily focuses on model-free reinforcement learning. The model-based algorithms to be further investigated in future work. Within the model-free framework, agents lack knowledge of the environment, including rules and rewards, and learn policy through trial and error. Planning involves determining how to achieve complete tasks. Planning can be centralized or decentralized. In centralized planning, a central decision-maker plans for all agents, whereas in decentralized planning, each agent independently creates its own action plan. The baseline algorithm in this work, MAPPO, is an example of decentralized planning. Each agent outputs actions based on observations and receives environmental rewards, continuously adjusting actions to maximize cumulative rewards, ultimately forming the optimal policy π. The planning, consideration and executing processes is implicitly contained within the policy π, as π has learned the long-term planning for task completion through numerous interactions. LazyAct evaluates the importance of states and determines the skip length. LazyAct uses a neural network to decide whether to skip states based on their importance. If a state is skipped, LazyAct generates a series of repeated actions based on the current state and the skip length.
For multi-agents scenarios, we develop LazyAct by building upon the MAPPO algorithm. MAPPO is an extension of the PPO algorithm for multi-agents settings. It employs a framework of centralized learning with decentralized execution. As shown in Fig 2, multiple agents interact independently with the environment and send experiential data to a central learner. The central learner utilizes this data to update the Actor and Critic networks, then sent the updated Actor network back to each agent. After training, the Critic network is no longer used, and only the Actor network used for inference. Our skip is a branch appended to the Actor network. The training of skip is integrated with the original actions, guided by the Critic network during centralized learning. To refine the evaluation of the advantage function, we integrate the Generalized Advantage Estimator (GAE) [36] into Critic. The purpose of LazyAct is to reduce the computational cost in MAPPO for each agent by dynamically skipping unimportant states. This mechanism is compatible with the MAPPO framework and can enhance its performance.
The central learner uses experience data to update the Actor and Critic networks and then sends the updated Actor network back to each agent. LazyAct outputs skip kt to skip unimportant states.
As depicted in Fig 2, the MAPPO process is a decentralized execution paradigm where each agent can only observe its own local observation o, without access to the global state s or communication with other agents. Consequently, agents must make decisions based on partially observable o. Unlike centralized execution, the action space in MAPPO does not increase with the number of agents. Moreover, the multi-agent system can decompose the complex joint action space into several smaller subspaces, allowing each agent to focus on and execute only a subset of actions, thereby simplifying the decision-making process. Regarding the aspect of skipping, LazyAct can better control the proportion of skips compared to TempoRL, satisfying specific constraints, such as requiring only a certain proportion of agents to make decisions in each state.
Formal problem definition.
We formulate LazyAct, grounded in the MAPPO framework, as a constrained policy optimization problem, as expressed in Eq (10). Here, o denotes the partial observation of an agent, with j indicating the j-th agent. signifies the number of agents within the multi-agents task. The indicator function
if the j-th agent chooses to skip policy calculation at state st. The constraint in Eq (10) implies that for each state, the proportion of agents not engaging in decision-making should exceed the threshold ϵ, thereby diminishing the computational expense associated with processing each state.
In the MAPPO variant of LazyAct, the agent predicts both skip and action based on its observation o. The computation of the advantage function A(s, o, k, a) takes into account the global state s to facilitate credit assignment among the multiple agents. Analogous to the advantage function in IMPALA, we treat the skip as a cost Ct that is subtracted from the reward Rt, and we employ the Lagrange multiplier α to ensure that the constraints are met. Eq (11) delineates the advantage function derived from GAE. Here, λ is the GAE parameter that determines the balance between the weight of current returns and future returns. The term A(s, o, k, a) is equivalent to the gaet.
The policy definition is presented in Eq (12), encompassing both skip and action branches. Diverging from the single-agent case, the policy πθ is conditional on the agent’s observation o. Within our framework, all agents utilize a shared neural network structure and parameter set.
Multiplier optimization.
The multiplier controls the influence of skip cost. We apply the similar linear updater to α, which is shown in Eq (13). Where and
represent the number of agent and episode length, respectively.
Unconstrained pre-training and constrained fine-tuning
Directly training a policy with skip constraints can result in unstable learning and potential policy collapse, particularly in tasks characterized by sparse rewards. To mitigate this, we employ unconstrained pre-training within the LazyAct framework. We initialize the skip ratio threshold ϵ to 0.0 to facilitate the training of the policy network parameter , focusing on maximizing cumulative reward. Subsequent to this pre-training phase, we introduce the specified constraints and fine-tune the last layer of the neural network based on the pre-trained parameter
. We then proceed with further training to derive the policy network θ that adheres to the imposed constraints.
The benefit of this approach is that the agent attains a comprehensive understanding of the environment and establishes an initial policy π. Leveraging this policy π, the agent can effectively explore various skip options while maintaining the integrity of the policy. However, if the threshold ϵ is set too large, it may still lead to policy collapse. In certain tasks, skipping states can prevent the policy from learning valuable information. Consequently, the agent may find that no policy can fulfill the specified constraints. This suggests that different tasks have varying ceilings for ϵ. We ascertain this upper limit by incrementally increasing ϵ. Typically, Our policy learning is performed according to the ϵ constraint provided by the user. When the policy collapses, it means that the constraints are too strict, and it is challenging to learn a policy that satisfies the ϵ constraint. In such cases, the user may need to reconsider and relax the constraints to facilitate successful learning.
Experiments
Training setups
We have implemented our LazyAct algorithm within the frameworks of IMPALA and MAPPO, utilizing PyTorch 1.10.1. All experimental procedures were conducted on a GPU server equipped with a GTX 1080Ti graphics processing unit and an Intel Xeon Gold 5118 processor, which features 48 cores. To mitigate the impact of random seeds and to obtain robust average outcomes, we conducted three replicate experiments for each algorithm. In the context of single-agent tasks, we performed experimental comparisons across 6 Atari tasks. For multi-agents scenarios, we established 5 distinct tasks within the SMAC environment, as detailed in Table 1.
In the above single-agent tasks, the state is image. So we use the convolutional neural network (CNN) as the feature network, and its network structure is shown in Table 2. The “Conv” columns show the filer shape of the convolution, Channel(Kernel-size). We set the skip range is 0 → 9.
The “Conv” columns show the filer shape of the convolution, Channel(Kernel-size).
In the multi-agents tasks, the state is vector. So we use the fully-connected neural network (FNN) as the feature network, and its network structure is shown in Table 3. And we set the skip range is 0 → 2. Specifically, the input of critic contains observation o, global state s and the average current skip ratio.
The compared baseline algorithms are listed as follow:
- IMPALA [20]: It is a parallel Actor-Critic training framework that decouples sampling and training to maximize system throughput, and utilizes value correction to compensate for off-policy issues.
- MAPPO [21]: It is an extension of the PPO algorithm on multi-agents systems and is currently one of the most effective algorithms.
- TempoRL [34]: It provides an evaluation of the skip based on the state and action.
Scores vs Skip ratios
The comparison in single-agent tasks.
We analyze the training curves of LazyAct, IMPALA, and TempoRL in terms of cumulative reward, as depicted in Fig 3. Subsequently, we examine the training skip ratio curves of LazyAct across various threshold values of ϵ, as illustrated in Fig 4. Additionally, we compare the evolution of the parameter α in LazyAct for different threshold values of ϵ, as shown in Fig 5. The findings indicate that LazyAct surpasses IMPALA in both sample efficiency and policy quality. We delve into the reasons behind these results. In environments with sparse rewards, the ability to skip allows for the rapid feedback of future rewards to the current state, thereby enhancing reward density. Furthermore, TempoRL exhibits superior performance to IMPALA in both score and skip metrics. However, TempoRL’s lack of constraints necessitates the exploration of a broader search space, which slightly detracts from its score and skip outcomes compared to our LazyAct algorithm.
LazyAct starts training from an unconstrained pre-trained model.
We evaluate the trained policies obtained from LazyAct, IMPALA, and TempoRL, examining their cumulative rewards and skip ratios, as presented in Table 4. The values within each cell are denoted as score (skip ratio), with our algorithms LazyAct0.7 and LazyAct0.8 indicating the outcomes when the skip threshold ϵ is set to 0.7 and 0.8, respectively. The data reveal that LazyAct markedly enhances the skip ratio without compromising the score; in fact, its average score surpasses that of both IMPALA and TempoRL. This can be explained by the principle that skipping irrelevant states accelerates the feedback of future rewards to the current state s, particularly in tasks with sparse rewards. Moreover, the skip ratio achieved by LazyAct aligns with the specified threshold ϵ. Additionally, TempoRL is not suitable for tasks that require a predetermined skip ratio due to its lack of control over the skip ratio.
Each cell represents the score(skip ratio).
The comparison in multi-agents tasks.
Given that TempoRL is tailored for single-agent settings and is not directly applicable to multi-agents tasks, we restrict our comparisons to the most robust baseline, MAPPO. We assess the training curves of LazyAct and MAPPO in terms of win rate, as depicted in Fig 6. Subsequently, we examine the training skip ratio curves of LazyAct across various threshold values of ϵ, as illustrated in Fig 7. Furthermore, we compare the evolution of the parameter α in LazyAct for different threshold values of ϵ, as shown in Fig 8.
LazyAct starts training from an unconstrained pre-trained model.
We benchmark the trained policies of LazyAct against those of MAPPO, evaluating their win rates and skip ratios, as detailed in Table 5. The values within each cell are presented as win rate (skip ratio), with our algorithms LazyAct0.2 and LazyAct0.4 indicating the outcomes when the skip threshold ϵ is set to 0.2 and 0.4, respectively. The data demonstrate that LazyAct successfully increases the skip ratio, satisfying the specified constraint threshold ϵ. Concurrently, LazyAct maintains a final win rate comparable to that of MAPPO. With a skip ratio of 0.2, the average win rate decrement does not exceed 2%, and when the skip ratio is 0.4, the average win rate decrement is no more than 2.7%.
Each cell represents the Win rate(skip ratio).
Time and FLOPs saving
In this section, we compared the reduction in time and floating-point operations (FLOPs) of LazyAct against the baseline algorithm. Notably, different algorithms require varying steps to complete tasks, and the computation time and FLOPs are not directly proportional to the skip ratio. Moreover, the added skip branches result in extra computation time and FLOPs. Additionally, we measured computation time on the 1080Ti, where due to parallel computing, the time is not directly proportional to FLOPs. So, We directly measured the actual runtime of different algorithms.
Table 6 shows the savings in time and FLOPs of LazyAct compared to baseline algorithms in single-agent tasks. The results indicate that LazyAct significantly reduces both time and FLOPs compared to IMPALA and TempoRL, while maintaining high scores. With a higher skip ratio than TempoRL, LazyAct achieves noticeable reductions in time and FLOPs.
Each cell represents the Time(GFLOPs).
Table 7 shows the savings in time and FLOPs of LazyAct compared to baseline algorithms in multi-agents tasks. The results indicate that LazyAct significantly reduces both time and FLOPs compared to MAPPO, while maintaining high scores.
Each cell represents the Time(KFLOPs).
Toy example
To gain an intuitive understanding of LazyAct’s decision-making process, we visualized the behavior of SMAC-25m in a multi-agents setting. Fig 9 illustrates the number of agents making decisions at each state, with gray cells indicating that the agent is running policy inference and white cells signifying a skip decision. In this particular task, LazyAct employs only approximately half of the agents to compute the policy.
Conclusion
This paper introduces LazyAct, a novel lazy actor approach. Drawing inspiration from human decision-making processes, we aim to reduce the computational overhead of policy evaluation while preserving policy quality by bypassing irrelevant states. To achieve this, we incorporate a skip branch into the actor network, which is responsible for predicting whether to skip certain states. We have formulated an optimization objective that includes skip mechanisms for both the single-agent algorithm IMPALA and the multi-agents algorithm MAPPO, achieving substantial results on Atari and SMAC tasks, respectively. By capitalizing on the redundancy inherent in sequential decision-making, we dynamically skip states to enhance efficiency. In future research, we plan to adapt LazyAct to additional DRL algorithms, making it a versatile and widely applicable module. Moreover, we intend to integrate it with other neural network architectures, such as Transformer, and to develop new optimization techniques.
Supporting information
S2 File. The data of LazyAct.
In reinforcement learning tasks, data is generated from the environment code.
https://doi.org/10.1371/journal.pone.0318778.s002
(ZIP)
References
- 1. Silver D., Schrittwieser J., Simonyan K., Antonoglou I., Huang A., Guez A., et al. 2017. Mastering the game of go without human knowledge. nature, 550 (7676), pp.354–359. pmid:29052630
- 2. Tang X, Zhong G, Li S, Yang K, Shu K, Cao D, et al. Uncertainty-aware decision-making for autonomous driving at uncontrolled intersections. IEEE Transactions on Intelligent Transportation Systems. 2023 Jun 19.
- 3. Kim G, Baldi P, McAleer S. Language models can solve computer tasks. Advances in Neural Information Processing Systems. 2024 Feb 13;36.
- 4. Ge Y, Hua W, Mei K, Tan J, Xu S, Li Z, et al. Openagi: When llm meets domain experts. Advances in Neural Information Processing Systems. 2024 Feb 13;36.
- 5. He Y, Xiao L. Structured pruning for deep convolutional neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2023 Nov 28.
- 6.
Wang Z, Li C, Wang X. Convolutional neural network pruning with structural redundancy reduction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 2021 (pp. 14913-14922).
- 7. Anwar S, Hwang K, Sung W. Structured pruning of deep convolutional neural networks. ACM Journal on Emerging Technologies in Computing Systems (JETC). 2017 Feb 9;13(3):1–8.
- 8.
Yang J, Shen X, Xing J, Tian X, Li H, Deng B, et al. Quantization networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 2019 (pp. 7308-7316).
- 9. Liang T, Glossner J, Wang L, Shi S, Zhang X. Pruning and quantization for deep neural network acceleration: A survey. Neurocomputing. 2021 Oct 21;461:370–403.
- 10.
Xu K, Han L, Tian Y, Yang S, Zhang X. Eq-net: Elastic quantization neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2023 (pp. 1505-1514).
- 11. Gou J, Yu B, Maybank SJ, Tao D. Knowledge distillation: A survey. International Journal of Computer Vision. 2021 Jun;129(6):1789–819.
- 12.
Czarnecki WM, Pascanu R, Osindero S, Jayakumar S, Swirszcz G, Jaderberg M. Distilling policy distillation. InThe 22nd international conference on artificial intelligence and statistics 2019 Apr 11 (pp. 1331-1340). PMLR.
- 13. Zhong V, Mu J, Zettlemoyer L, Grefenstette E, Rocktäschel T. Improving policy learning via language dynamics distillation. Advances in Neural Information Processing Systems. 2022 Dec 6;35:12504–15.
- 14. Han Y, Huang G, Song S, Yang L, Wang H, Wang Y. Dynamic neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2021 Oct 6;44(11):7436–56.
- 15.
Huang G, Chen D, Li T, Wu F, Van Der Maaten L, Weinberger KQ. Multi-scale dense convolutional networks for efficient prediction. arXiv preprint arXiv:1703.09844. 2017;2(2).
- 16. Huang G, Wang Y, Lv K, Jiang H, Huang W, Qi P, et al. Glance and focus networks for dynamic visual recognition. IEEE transactions on pattern analysis and machine intelligence. 2022 Aug 8;45(4):4605–21.
- 17.
Yuan Z, Wu B, Sun G, Liang Z, Zhao S, Bi W. S2dnas: Transforming static cnn model for dynamic inference via neural architecture search. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16 2020 (pp. 175-192). Springer International Publishing.
- 18.
Bacon PL, Harb J, Precup D. The option-critic architecture. In Proceedings of the AAAI conference on artificial intelligence 2017 Feb 13 (Vol. 31, No. 1).
- 19.
Precup D. Temporal abstraction in reinforcement learning. University of Massachusetts Amherst; 2000.
- 20.
Espeholt L, Soyer H, Munos R, Simonyan K, Mnih V, Ward T, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. InInternational conference on machine learning 2018 Jul 3 (pp. 1407-1416). PMLR.
- 21.
Lohse O, Pütz N, Hörmann K. Implementing an online scheduling approach for production with multi agent proximal policy optimization (MAPPO). InAdvances in Production Management Systems. Artificial Intelligence for Sustainable and Resilient Production Systems: IFIP WG 5.7 International Conference, APMS 2021, Nantes, France, September 5–9, 2021, Proceedings, Part V 2021 (pp. 586-595). Springer International Publishing.
- 22.
Gupta S, Agrawal A, Gopalakrishnan K, Narayanan P. Deep learning with limited numerical precision. In International conference on machine learning 2015 Jun 1 (pp. 1737-1746). PMLR.
- 23.
Lane ND, Georgiev P. Can deep learning revolutionize mobile sensing?. In Proceedings of the 16th international workshop on mobile computing systems and applications 2015 Feb 12 (pp. 117-122).
- 24.
Han S, Mao H, Dally WJ. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. 2015 Oct 1.
- 25.
Anwar S, Hwang K, Sung W. Fixed point optimization of deep convolutional neural networks for object recognition. In2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) 2015 Apr 19 (pp. 1131-1135). IEEE.
- 26.
Lee J, Kim D, Ham B. Network quantization with element-wise gradient scaling[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 6448-6457.
- 27.
Peters J, Fournarakis M, Nagel M, Van Baalen M, Blankevoort T. QBitOpt: Fast and Accurate Bitwidth Reallocation during Training[C]//In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 1282-1291.
- 28.
Cheng A C, Lin C H, Juan D C, et al. Instanas: Instance-aware neural architecture search[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2020, 34(04): 3577-3584.
- 29.
Cui, Weihao, Zhenhua Han, Lingji Ouyang, Yichuan Wang, Ningxin Zheng, Lingxiao Ma, et al. Optimizing dynamic neural networks with brainstorm[C]//In 17th USENIX Symposium on Operating Systems Design and Implementation. 2023:797-815.
- 30.
Braylan A, Hollenbeck M, Meyerson E, Miikkulainen R. Frame skip is a powerful parameter for learning to play atari. In Workshops at the twenty-ninth AAAI conference on artificial intelligence 2015 Apr 1.
- 31. Khan A, Feng J, Liu S, Asghar MZ. Optimal Skipping Rates: Training Agents with Fine‐Grained Control Using Deep Reinforcement Learning. Journal of Robotics. 2019;2019(1):2970408.
- 32.
Metelli AM, Mazzolini F, Bisi L, Sabbioni L, Restelli M. Control frequency adaptation via action persistence in batch reinforcement learning. In International Conference on Machine Learning 2020 Nov 21 (pp. 6862-6873). PMLR.
- 33.
Sharma S, Lakshminarayanan AS, Ravindran B. Learning to Repeat: Fine Grained Action Repetition for Deep Reinforcement Learning. In International Conference on Learning Representations 2016 Nov 4.
- 34.
Biedenkapp A, Rajan R, Hutter F, Lindauer M. TempoRL: Learning when to act. In International Conference on Machine Learning 2021 Jul 1 (pp. 914-924). PMLR.
- 35.
Sabbioni L, Al Daire L, Bisi L, Metelli AM, Restelli M. Simultaneously updating all persistence values in reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence 2023 Jun 26 (Vol. 37, No. 8, pp. 9668-9676).
- 36.
Schulman J, Moritz P, Levine S, Jordan M, Abbeel P. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438. 2015 Jun 8.