Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Intelligent penetration testing method for power internet of things systems combining ontology knowledge and reinforcement learning

  • Shoudao Sun ,

    Roles Data curation, Funding acquisition, Investigation, Methodology, Project administration

    leoooh@163.com

    Affiliation State Grid Liaoning Electric Power Co., Ltd., Shenyang Power Supply Company, Shenyang, China

  • Yi Lu,

    Roles Conceptualization, Formal analysis, Investigation

    Affiliation State Grid Liaoning Electric Power Co., Ltd., Shenyang Power Supply Company, Shenyang, China

  • Di Wu,

    Roles Visualization, Writing – review & editing

    Affiliation State Grid Liaoning Electric Power Co., Ltd., Shenyang Power Supply Company, Shenyang, China

  • Guangyan Zhang

    Roles Software, Validation, Writing – original draft

    Affiliation State Grid Liaoning Electric Power Co., Ltd., Shenyang Power Supply Company, Shenyang, China

Abstract

With the application of new-generation information technologies such as big data, artificial intelligence, and the energy Internet in Power Internet of Things (IoT) systems, a large number of IoT terminals, acquisition terminals, and transmission devices have achieved integrated interconnection and comprehensive information interaction. However, this transformation also brings new challenges: the security risk of intrusions into power IoT systems has significantly increased, making the assurance of power system information security a research hotspot. Penetration testing, as an essential means of information security protection, is critical for identifying and fixing security vulnerabilities. Given the complexity of power IoT systems and the limitations of traditional manual testing methods, this paper proposes an automated penetration testing method that combines prior knowledge with deep reinforcement learning. It aims to intelligently explore optimal attack paths under conditions where the system state is unknown. By constructing an ontology knowledge model to fully utilize prior knowledge and introducing an attention mechanism to address the issue of varying state spaces, the efficiency of penetration testing can be improved. Experimental results show that the proposed method effectively optimizes path decision-making for penetration testing, providing support for the security protection of power IoT systems.

Introduction

The Power Internet of Things (IoT) is the application of IoT technology in smart grids. It effectively integrates power system infrastructure resources by combining various information sensing devices with existing network and database technologies, forming a large intelligent network among electrical devices and between devices and personnel [1]. The rapid development of new-generation information technologies, such as big data, artificial intelligence, and the energy Internet, has propelled the scale of Power IoT information systems into a stage of explosive growth [2]. Leveraging a vast number of IoT terminals, acquisition terminals, transmission devices, etc., it has achieved the integration and interconnection of various types of energy systems, spatiotemporal information, and business information, as well as the perception and information exchange throughout the processes of energy production, transmission, storage, trading, and consumption [3]. Meanwhile, some uncontrollable factors and changes in the physical contact environment have significantly increased the risk of intrusion into power IoT systems. Ensuring the information security of power systems has become a hot research direction in recent years [4, 5].

Power IoT tightly integrates the information systems and physical systems of the power sector, creating an open and interconnected system that significantly increases the information security risks of the entire power system [6]. For instance, attackers can disrupt normal system operations through ransomware attacks, denial-of-service attacks, phishing attacks, malware attacks, zero-day vulnerabilities, and other methods, potentially leading to downtime, equipment damage, triggering emergency responses, and other security incidents. Therefore, to guard against these threats, the power sector needs to formulate and deploy comprehensive security protection measures. Among these, penetration testing is a crucial method for security protection. It helps in identifying and fixing security vulnerabilities, thereby protecting information and business operations from risks such as hacker attacks and data breaches, reducing the likelihood of security incidents [7]. According to MarketsandMarkets [8], the global penetration testing market is estimated to be worth $ 1.7 billion in 2024 and is projected to reach $ 3.9 billion by 2029 at a compound annual growth rate of 17.1% during the forecast period.

During penetration tests, testers employ various attack methods and tools to attempt breaking through the system’s security defenses, thereby identifying and exploiting potential security flaws [6]. Due to the characteristics of vulnerabilities such as their concealment, dynamism, and dependency, penetration testers must not only comprehend the nature of these vulnerabilities within power grid interconnected information systems but also leverage a variety of testing tools to conduct efficient and thorough penetration testing. Therefore, this role is often filled by experts with extensive development experience and comprehensive technical knowledge. The complexity of the power IoT poses significant challenges for manually discovering hidden vulnerabilities, analyzing the attack surfaces, and designing penetration testing schemes [9]. For instance, in a certain power system mini-program, it requires manual packet capturing and data modification several times to uncover an SQL injection method, which can lead to massive data leakage from the database. This highly expert-dependent and labor-intensive testing approach increasingly fails to meet the growing needs of vulnerability penetration testing in the rapid development of the power IoT.

Currently, there are relatively few tools for intelligent penetration testing path planning for the power IoT systems. The primary reason for this gap lies in the lack of effective accumulation and integration of security information specific to power IoT systems. Specifically, while a wealth of useful security information exists—such as vulnerability databases, threat intelligence databases, and security knowledge bases—this knowledge is not fully utilized during penetration testing. On the other hand, reinforcement learning (RL) agents possess the capability for autonomous learning, adjusting and optimizing their strategies based on feedback signals from the environment to find optimal policies. Leveraging RL technology for automated penetration testing of power IoT systems can enable intelligent exploration of target networks, exploitation of potential system vulnerabilities, and automatic generation of optimal attack paths. However, current research on RL-based penetration testing technology primarily focuses on the way of modeling the testing environment and training agents to execute optimal attack path planning strategies. There is relatively less attention given to issues faced during actual penetration testing processes, such as dynamic changes in the state space, incorporation of prior knowledge, and the execution efficiency of models.

To address these challenges, this paper first constructs an ontology knowledge model that describes the state of the system during the penetration testing process, based on prior knowledge about the topology, assets, components, and vulnerabilities of power IoT systems, forming a system state space matrix. We introduce RL theory to iteratively optimize the penetration testing path, exploring optimal path decision-making methods for penetration testing in power IoT systems. Finally, using a distribution automation system as a prototype, we establish a simulation test range to conduct tests, validations, and comparative analyses of the proposed method. The main innovations of this paper are summarized as follows:

1) A penetration testing method for power IoT systems based on deep RL is proposed, which can automatically learn and execute penetration paths even when the system state is unknown;

2) An ontology-based state matrix construction method for the training process of Deep Q-Networks (DQNs) is proposed, leveraging system topology, assets, components, and vulnerability prior knowledge to fully utilize system and security knowledge, thereby avoiding blind exploration behavior during DQN learning;

3) An attention mechanism is adopted to address the issue of changing input state spaces in neural networks, and an optimal path decision-making method for penetration testing is proposed based on the Rainbow DQN framework. The experiment demonstrates that the proposed method can achieve faster convergence with fewer iteration steps across different scenarios.

The subsequent contents of this paper are organized as follows: The Related Works section reviews the current development status of penetration testing technology; The Preliminary section introduces the concepts and theories of deep RL; The Methodology section elaborates on the penetration testing method based on deep RL; The Experiment section describes the experimental methods and results; and the Conclusion section summarizes the entire paper and provides future outlook.

Related works

Traditional penetration testing primarily relies on manual testing, which must adhere to given security testing methodologies and execution standards. Testers leverage security knowledge bases and the experience of experts to conduct relevant testing work. With the development of artificial intelligence (AI) technology, the processes and methods of penetration testing have become increasingly automated and intelligent. According to a survey by Haq et al. [10], as of 2021, using keywords such as “Penetration Testing”, “Pentest” and “Android Penetration”, 1040 relevant publications were identified in databases like IEEE Xplore, ACM Digital Library, and Springer Link, with 380 of these focusing on mobile device penetration testing. Additionally, searching the Engineering Village and Web of Science databases using the keywords “Penetration Testing” and “Security” revealed that over 100 related papers have been published annually since 2019 (based on preliminary screening results). Therefore, penetration testing, as an essential component of information security protection, has remained a key focus of research for security professionals.

Early automatic penetration testing adopted rule-based methods, where a series of actions required for a specific attack tactic are abstracted and then integrated into penetration tools [11]. Common tools include Nmap and Nessus. According to the type of rules, automatic penetration testing can be divided into single-rule and multi-rule categories. Single-rule methods focus on executing a particular attack tactic for a specific task to assist penetration testers in automating targeted penetration tasks but heavily rely on parameter configuration. Typical examples include DCShadow and Atbroker. Multi-rule methods integrate multiple attack tactics to achieve combinational penetration testing, increasing the degree of automation in the testing process [12]. However, the coordination between multiple attack tactics remains relatively low, with the penetration process still depending on parameter configuration. Representative tools include Kaboom, Metasploit, Cobalt Strike, and so on.

The extraction of penetration rules is the most critical requirement in rule-based methods. Depending on the extraction method, it can be categorized into rule extraction based on penetration experience, threat intelligence, and intrusion detection. Literature [13] proposes a semi-automatic knowledge extraction method based on ontology that combines security concepts with security technologies to form a penetration testing knowledge base, enhancing the automation level of penetration testing. Zhou et al. used natural language processing techniques to gather intelligence information on advanced persistent threats, combined with regular expressions to extract threat indicators, thereby constructing a penetration testing method based on the ATT&CK matrix [14]. The paper [15] describes an application method for penetration behavior extraction using a security knowledge graph, building a security knowledge graph from audit logs and extracting attack behavior instances using TransE. Rule-based penetration testing relies on rules formed through manual experience and target scenario analysis. While this approach adapts well to the target environment, it heavily depends on expertise and the rule library, making it difficult to generalize to researchers outside this specialized field.

To enhance the intelligence of automated penetration testing, as well as its flexibility and adaptability in complex environments, thereby better serving various testers in need, researchers have recently begun to explore methods that integrate AI technology with penetration testing. These efforts have led to the development of model-based approaches [16]. Reinforcement Learning (RL) based penetration testing is a typical method. It abstracts the penetration testing process into a sequential decision-making process, where the state of the test subject is taken as input, penetration testing actions as output, and the execution results of actions serve as rewards. Through the interaction between the agent and the environment, the agent is trained to converge on the optimal strategy, thereby achieving penetration testing of the environment.

Some RL-based algorithms model the uncertain transition relations of penetration testing states as a partially observable Markov decision process (POMDP) or a Markov Decision Process (MDP), and train through interactions with the environment to develop attack path planning strategies. Sarraute et al. [17] pioneered the decomposition of network structures into host-level POMDPs and integrated information gathering into the penetration testing workflow, aligning it with real-world scenarios. Subsequent work by Shmaryahu et al. [18] addressed scalability limitations through episodic planning for partial observability. Schwartz et al. [19] enhanced realism by modeling defenders’ active responses with information decay factors. POMDP performs excellently in simulating the uncertainty of penetration testing, but its solution complexity becomes significantly high in complex scenarios, making it difficult to apply in large-scale network environments. As a result, an increasing number of researchers have started modeling the penetration testing process as an MDP, which has the advantage of reducing model complexity and improving decision-making efficiency. Yousefi et al. [20] used MulVal [21] to generate attack graphs and applied attack graph matrices for MDP modeling. Zennaro et al. [22] focused on simplified penetration testing network attack and defense competitions (Capture The Flag) and modeled it as an MDP, solving the problem using a table-based Q-learning algorithm. To mitigate the issue of state and action space explosion, they proposed a method that uses imitation learning to provide prior knowledge for the agent. These methods have successfully modeled the PT problem into the RL paradigm. However, it struggles to mimic complex realistic environments and capture effective features. Observation space’s extension to model the environment further exacerbates computational complexity, limiting fitting performance and speed.

The emergence of DQNs introduces Deep Neural Networks (DNNs) to provide an efficient feature perception, enabling the agent to perceive the environment more efficiently. Zhou et al. [23] introduced network information gain as reward signals to guide exploration. Hu et al. [24] and Nguyen et al. [25] optimized action spaces through graph simplification and multi-level embedding. Sultana et al. [26] systematically evaluated deep RL stability across network topologies, revealing critical generalization challenges. These efforts culminated in Zhou et al.’s NDSPI-DQN framework [27], which integrated five DQN extensions with action decoupling to reduce dimensionality. Despite progress, unresolved issues in dynamic environment adaptation and invalid action filtering prompted investigations into knowledge-enhanced architectures. The random and ineffective exploration in the early training limits the model’s convergence efficiency and generalization, especially in dynamically and complex environments.

Table 1 provides a comparative analysis of the strengths and weaknesses of rule-based and DQN-based approaches. Rule-based methods rely on predefined logic and features, offering strong interpretability, low false positive rates, and fast execution speed. However, their generalizability is limited, as they can only cover predefined scenarios, and their automation level is relatively low. In contrast, DQN-based methods leverage DNNs to perceive the environment, enabling dynamic adaptation to similar scenarios and fully automated testing, significantly enhancing flexibility and adaptability. However, since these methods rely on data-driven decision-making, they suffer from false positive rates, operate as a “black box” with limited interpretability, and exhibit lower execution efficiency during the exploration phase.

thumbnail
Table 1. Comparison of the strengths and weaknesses of different penetration testing techniques.

https://doi.org/10.1371/journal.pone.0323357.t001

To address the issues faced by traditional DQN, recent studies emphasized embedding expert knowledge to improve learning efficiency. Zennaro et al. [22] simplified penetration testing as MDPs with imitation learning for knowledge injection. Li et al. [28] introduced expert prior knowledge to alleviate ineffective exploration, improved learning speed through hierarchical learning. Sychugov et al. [29] further explored adversarial inverse reinforcement learning for dynamic networks, introduced tools like “Deep Exploit” for collecting expert data. In [30], the proposed framework encompassed the collection and utilization of expert knowledge. In pretraining phase, the replay buffer primarily incorporates expert knowledge to prevent ineffective exploration. In formal training, the proportion of experience gained through the agent’s exploration is gradually increased in the replay buffer. This approach accelerates model convergence while ensuring generalization capability. Although incorporating prior knowledge into DQN has demonstrated its potential in automated PT, related research remains limited. Moreover, most prior knowledge-based studies have been conducted in known, static testing environments with relatively fixed state spaces. In practical scenarios, an agent may not have complete prior knowledge of the network topology, host assets, and vulnerability information, leading to potential dynamic changes in the state space. This dynamic nature conflicts with the requirement in DQN that the input state space maintains a fixed dimensionality within the neural network structure.

In summary, the automatic and intelligent execution of penetration testing has attracted considerable attention from researchers, leading to numerous theoretical and practical achievements. However, current results still present significant potential for optimization. For instance, addressing the challenge of state space changes due to dynamic or unknown environments, investigating possibilities to enhance DQN techniques to reduce iteration steps for quicker convergence, and whether DQN can handle different scales of test environments when solving the aforementioned issues are topics that require further investigation. Therefore, the subsequent discussion in this paper will delve into these matters.

Preliminary

Reinforcement learning

RL is a variant of machine learning that allows an agent to adjust its behavior by continuously perceiving the state of the environment and the rewards for actions, thereby finding the optimal action strategy in a given scenario [31]. For a standard RL framework, at each time step t, the RL agent observes the state from the environment and selects an action according to a policy in the set of policies . The agent then receives a reward signal from the environment and transitions to the next state st + 1. The goal of the agent is to find the optimal policy that maximizes the expected cumulative reward:

(1)

where is the discount factor determining the importance of future rewards, is the set of policies, represents the trajectory , and denotes the distribution of trajectories under policy . Q-Learning is a commonly used technique for finding the optimal policy [32]. For Q-Learning, the Q-function represents the expected return for taking action at in state st. Its update rule is given by:

(2)

where is the learning rate. However, traditional Q-Learning algorithms use tables to store st and at, which poses significant limitations in practice and cannot handle large state spaces. To address these issues, researchers introduced neural networks as function approximators for state-action pairs, leading to deep RL algorithms such as DQN. The training process of the DQN algorithm is illustrated in Fig 1. It introduces two separate networks, the Eval Network and the Target Network, for generating predicted Q-values and target Q-values, respectively. Additionally, it incorporates Experience Replay to improve the efficiency of experience utilization.

The loss function for DQN is defined as:

(3)

where represents the neural network parameters at the i-th iteration, is the set of parameters for the Target Network, and denotes the Q-value from the Target Network at the i-th iteration.

Rainbow DQN

Rainbow DQN is a comprehensive algorithm that integrates multiple improvements to DQN, first proposed by the DeepMind team in 2018 [33]. It primarily incorporates the following six enhancement strategies:

1) Double DQN: By using two networks (target network and behavior network) to estimate the maximum Q-value and select actions separately, it mitigates the overestimation problem of Q-values in DQN [34]. The Q-value calculation for the Target Network in Double DQN is given by:

(4)

where indicates selecting the best action in the next state using the behavior network , and evaluates the value of the selected action using the target network .

2) Dueling DQN: This architecture splits into two streams to decompose the Q-value into a value stream V(s) and an advantage stream , better separating state value from action value [35]. The Q-value is computed as:

(5)

3) Prioritized Experience Replay (PER): Samples are drawn based on their importance (such as temporal difference error (TD-error)) from the experience replay buffer, prioritizing learning from important samples to accelerate the learning process [36]. The TD-error measures the difference between the current predicted Q-value and the target Q-value, yi, and then adds a small constant to ensure non-zero priority,  +  . To sample from the experience replay buffer, each experience’s sampling probability P(i) is adjusted according to its priority raised to the power :

(6)

4) Multi-step Learning: By introducing n-step returns, it considers the cumulative discounted reward over the next n time steps, thus capturing long-term dependencies better [37]. The n-step return is defined as the sum of immediate rewards from the current step t to t + n−1, plus the maximum expected reward at step t + n:

(7)

5) Noisy Networks: By introducing parameterized noise sources into network weights, it enables the network to exhibit exploratory behavior naturally across different states [38]. For each weight in the neural network, the expression with added noise is:

(8)

where and represent the mean and standard deviation of the weight, and is a noise vector following a certain distribution.

6) Distributional RL: Instead of estimating a single Q-value, it aims to learn a probability distribution describing all possible cumulative discounted rewards, better capturing uncertainty. Categorical DQN is one implementation of this approach [39]. Given the current experience , the goal is to find a new distribution , calculated as , and then compute the new distribution based on Tz. The loss function typically uses cross-entropy loss:

(9)

Methodology

Problem formulation

The problem of automated penetration testing based on DQN can be modeled as an MDP [40], typically represented by the tuple , where: represents the state space of the environment, is the action space, denotes the reward function, is the state transition probability function, indicating the probability that executing an attack will transfer the state from st to st + 1, is the discount factor used to determine the importance of long-term rewards.

1) State Space The information acquired by the agent during the penetration process through scanning the environment can be expressed as a state matrix. Each row of this matrix represents the status of each host node in the network and its connections with other hosts, defined as follows:

(10)

where indicates the connection relationship between host node i and other nodes in the network, N is the network size; Ass(Mi) and Attr(Mi) represent the asset information and vulnerability information of host node i, respectively. Thus, the entire test environment’s state space is . Subsequent section will detail the method for constructing the state matrix .

2) Action Space The actions that the agent can execute during the penetration test include a set of behaviors such as collecting asset information (e.g., port scanning, service scanning, operating system (OS) detection, system hardware/software identification), vulnerability scanning (e.g., OS vulnerability scanning, service vulnerability scanning, hardware/software vulnerability probing), and exploitation (specific services based on discovered vulnerabilities). Each action in the action space can be described as a specific attack behavior or tool. It should be noted that the dimensionality of the action space may change as new vulnerabilities are exposed and penetration techniques evolve, leading to an increase in attack tools. If the action space changes, the trained DQN parameters may need to be updated or adjusted, but this is beyond the scope of this paper.

3) Reward Function The goal of the agent is to gather additional information about target hosts at minimal attack cost, thereby increasing the success rate of exploiting vulnerabilities to attack the host. Therefore, the Reward Function r(i) for any host node i in the environment can be expressed as:

(11)

where represents the value of host node i; is the vulnerability score of host node i, which can be calculated using the Common Vulnerability Scoring System (CVSS) formula [41] and the vulnerability information of host node i; represents the cost of executing attack a, considering factors like execution time and resource consumption.

4) Optimization Objective According to Equation (1), the optimization objective for the neural network is to maximize the cumulative reward through executing policy . Therefore, the value function for attack behavior, or Q-function, can be expressed as:

(12)

Thus, for a given state s, the optimal attack action a* can be obtained as:

(13)

In current state, the system will evaluate all possible actions in the action space to assess their corresponding Q-values, determining which action can yield the highest reward. In this selection process, the exploitation of certain vulnerabilities may result in higher Q-values, making them more likely to be selected under the effect of the operation.

The ontologies of system and security knowledge

Compared to allowing the agent to explore blindly, integrating prior knowledge into DQN can significantly enhance the learning capabilities of the agent. Therefore, in this paper, we propose a method that embeds system prior knowledge and security prior knowledge into the DQN learning process, as illustrated in Fig 2.

Firstly, based on the topology of the power IoT (using a distribution automation system as an example in this paper), a Node connection model is established. The connections between nodes are then vectorized (Node2Vector), resulting in a relationship matrix , where Mi,j indicates whether there is a connection between node i and node j: 1 for connected, –1 for not connected, and N represents the network scale. Subsequently, each node in the test environment is instantiated according to the ontology knowledge base. The ontology knowledge includes asset attributes of the nodes and vulnerability information. Asset attributes encompass hardware, OS, software, ports, services, etc., while vulnerability information is obtained from the Common Vulnerabilities and Exposures (CVE) vulnerability database and CVSS vulnerability scoring system, providing data on the complexity, exploitability, and associated security risks of vulnerabilities. The asset attributes and vulnerability attributes of the nodes are then vectorized (Vul2Vector and Asset2Vector), yielding asset attribute vectors and vulnerability attribute vectors . Finally, following the approach depicted in Fig 2, the state matrix is constructed.

It is evident that as penetration testing proceeds, the host information, asset information, and vulnerability data of the tested system will be continuously uncovered, leading to an increasingly enriched state matrix. However, it is noteworthy that during penetration testing based on DQN, the Agent can only select actions within its action space. Therefore, if there are no corresponding actions available for newly discovered host asset information and vulnerability data, the Agent cannot exploit these data. Consequently, we can predefine a fixed vector length for asset information and vulnerability data according to the Agent’s action space, ensuring that the dimensions of Ass(Mi) and Attr(Mi) remain fixed (the content in the vectors may temporarily be null values). This approach allows us to focus solely on the processing issues of newly discovered hosts (nodes), which will be elaborated in the following sections.

Rainbow DQN-based modeling method

As mentioned above, the state matrix increases with the number of host nodes explored during the penetration testing process. Consequently, the input dimensions for the DQN network change dynamically. However, a typical DQN employs fully connected layers, which require the number of neurons in each layer to be fixed. To address this issue, this paper introduces an attention mechanism to adapt to the dynamic changes in the state matrix , as illustrated in Fig 3.

thumbnail
Fig 3. The rainbow DQN-based penetration testing framework.

https://doi.org/10.1371/journal.pone.0323357.g003

We decompose the state matrix into two parts. The first part consists of the asset attributes and vulnerability attributes of the host nodes. Since the format of these attributes is fixed, even if a host node lacks a particular attribute, we retain the framework and fill the corresponding section with zeros. Therefore, this part has a fixed dimension, containing elements . The second part comprises the connection relationships between host nodes. As the agent explores the network, the number of discovered host nodes increases continuously. For this reason, an attention mechanism is applied to learn from this process. It computes a weighted sum of the relationships to obtain a representation with a fixed dimension:

(14)

where x represents the connection relationship of a host node; denotes the attention weight. The calculation of the weight factors can be expressed using the softmax function [42]:

(15)(16)

where and are automatically learned parameters, with mapping to the Key value of the network and mapping to the Query. Through this approach, regardless of the number of host nodes discovered by the agent, the parameters that the DQN network needs to learn remain constant. This not only simplifies the training process but also enhances the agent’s adaptability to dynamic changes in the environment.

Algorithm 1. Training algorithm of rainbow DQN.

Input: Initial network parameters , replay buffer B, batch size m, learning rate , discount factor , update frequency F, and for soft update.

Output: Trained network .

1: Initialize with random weights.

2: Initialize noise parameters for each weight and

   (mean and standard deviation of the weights).

3: repeat

4:   Capture state s and connection information from

   environment.

5:   Construct connection-state matrix

   .

6:   Sample from Z.

7:   Fuse features into f and input into training network

   .

8:   Apply noisy networks to weights: .

9:   Compute value V(s), advantage , and Q-values using

   dueling architecture.

10:   Apply double Q-learning correction:

11:   Select action .

12:   Store in buffer B.

The right-hand side of Fig 3 shows the implementation framework for Rainbow DQN, while Algorithm 1 details the specific training process. Initially, state information s is captured from the environment, and a connection-state matrix is constructed. Then, integrated with an attention mechanism, the fused state features are extracted and fed into the Training Dueling Network . Next, noisy networks are applied to obtain the weights of each layer’s connections in the neural network, and the dueling architecture is used to calculate the value stream V(s) and the advantage stream . The Double Q-Learning technique is then applied to correct the Q-value , and the action a* that maximizes the Q-value is selected and applied to the target environment. The tuple is stored in the buffer B. This process is repeated until the buffer meets the training requirements. During the training phase, the target Q-values are calculated using Prioritized Experience Replay (PER), and the loss is computed using cross-entropy. Parameters are updated using gradient descent.

Experiment

Experimental setup

In this section, we elaborate on several experimental scenarios to verify the effectiveness of our approach by answering the following research questions ( RQs).

RQ1: Can the proposed method reduce the number of iterations for the agent to achieve faster convergence?

RQ2: Is the proposed approach adaptable to rescalable network scenarios?

To verify the effectiveness of the proposed method, we established a penetration testing environment for distribution automation systems as shown in Fig 4 using GNS3 (version 2.2.52) and VMware Workstation Pro 17 on a Windows 10 platform. The setup includes one main station and two substations, comprising a total of ten hosts, three subnet switches, and one core switch. Each station contains a sensitive host, which is the target of the penetration test (Operator: M0.1, FTU: M1.2 and M2.2). The configuration information of the hosts in the target environment is summarized in Table 2, where the value of sensitive hosts is set to 100, while non-sensitive hosts have a value of 0. The four hosts in the Main Station run the Windows operating system with services such as HTTP, SSH, FTP, and SAMBA configured; the hosts in the Sub-Stations use Linux OS with services like FTP and SSH. Switches are loaded directly into GNS3 using Cisco images. The attacking host, which serves as the penetration testing host, connects via network bridging to the physical Ethernet interface of the Main Station, aiming to discover the three sensitive hosts within the network and complete the penetration test.

thumbnail
Table 2. Configuration information in the experimental platform.

https://doi.org/10.1371/journal.pone.0323357.t002

Services and processes frequently exploited by attackers are selected to represent vulnerabilities. For instance, an agent can exploit FTP vulnerabilities to gain Root access on a host or leverage Tomcat vulnerabilities to escalate privileges from User level on a compromised host. Table 3 lists the actions available to the agent within this environment, along with the cost of execution, success probability, and the permission that can be obtained upon successful execution. The action list includes three scanning actions, four service-specific vulnerability exploitation actions, and three process-specific privilege escalation actions. The success probability for vulnerability exploitation actions is set according to CVSS high, medium, and low difficulty levels at 0.2, 0.5, and 0.8 respectively, while the success probability for other actions is set to 1.

The penetration testing agent was developed in a Linux environment using Python 3.9.20, with PyTorch serving as the framework for algorithm implementation. The hardware configuration includes an Intel Core Ultra 7 165H CPU and an NVIDIA A2000 Ada GPU. The training parameters for the model are listed in Table 4.

Additionally, to validate the scalability of the method, we constructed test environments with different network scales. Specifically, we built test scenarios with 50 and 100 hosts, named Scenario 2 and Scenario 3, respectively. The number of sensitive hosts remains unchanged at three. Therefore, as the network scale expands, the rewards become sparser.

Result discussion

(1) Performance comparison under different algorithms (for RQ1)

The learning objective for the Agent is to acquire access to all sensitive hosts within the target environment using fewer steps, thereby obtaining a reward value. Thus, the reward value can be used as a measure of the agent’s policy performance. Fig 5 illustrates the training performance under different learning algorithms, where Fig 5(a) shows the change in cumulative reward per episode over the number of training episodes, and Fig 5(b) shows the change in the number of steps per episode over the training steps. From Fig 5(a), it can be observed that during the initial training phase, the rewards obtained by the agent per episode are relatively small, but these increase as training progresses.

thumbnail
Fig 5. Comparison of training results of different deep learning algorithms (Scenario 1).

(a) Cumulative rewards during the training process, (b) Mean episode steps during the training process.

https://doi.org/10.1371/journal.pone.0323357.g005

Comparing the training processes of different algorithms, it can be seen that during the initial exploration phase, Noisy Network, Double DQN, Dueling DQN, etc., maintain a relatively low “negative reward”, with the rewards per episode ranging between –25 and –50. However, they require approximately 800–900 episodes to converge. As training continues, the convergence speed of different algorithms becomes apparent. The proposed algorithm in this paper converges to the maximum cumulative reward value in approximately 350 episodes, while other algorithms eventually converge after more than 500 episodes, with the longest taking up to 1300 episodes. Regarding the change in the number of steps per training episode, as shown in the results from Fig 5(b), the proposed method converges faster and more stably compared to other algorithms. Compared with other algorithms, the proposed Double DQNchieves convergence nearly 500 episodes earlier.

From the experimental results, it can be seen that the learning curve of the proposed method rises rapidly and quickly approaches the optimal value, maintaining a high level of average cumulative reward throughout the training process. This indicates that the method can effectively utilize training data to accelerate convergence speed. By integrating various DQN improvement techniques, it contributes to a more stable learning process, effectively reducing estimation errors and variances, thereby enhancing the robustness of the policy. The main reason lies in that these improvement techniques address certain limitations of DQN to some extent; for instance, Multi-step Learning and Distributional RL primarily focus on multi-step returns and reward distributions, providing richer learning information, thus their learning curves rise faster. Dueling DQN and Double DQN mainly concentrate on evaluating action values and reducing overestimation, hence they achieve relatively higher learning rewards. The experimental curves of these methods in Fig 5(a) also verify these viewpoints. Therefore, after integrating different DQN improvement techniques, the proposed method demonstrates good performance.

During the aforementioned experiment, we recorded the proportion of accesses to the target host by the proposed method, with the results shown in Fig 6. As can be seen from Fig 6, in the early stages of the experiment, the agent accessed the target hosts with a probability of less than 20%, indicating that the agent spent considerable time attempting accesses on other hosts. However, after fewer than 500 episodes, the proposed method could identify the target host more accurately for penetration testing, increasing the proportion of accesses to the target hosts to between 30% and 40%.

thumbnail
Fig 6. Number of accesses to the target hosts per episode and the proportion of accesses to the target hosts.

https://doi.org/10.1371/journal.pone.0323357.g006

Fig 7 discusses the impact of hyper-parameters on model performance, with Fig 7(a) showing results under different discount factors and Fig 7(b) under different learning rates. We evaluated performance using the mean and variance of rewards from the first 100 episodes. The experimental results indicate that while the changes in the mean reward across different hyper-parameters are not particularly significant, staying within a range of –34 to –26, the variations in reward values are more noticeable. Specifically, for Fig 7(a), when the learning rate of 0.0025, the model performs best at a discount factor of 0.9 (with an overall smaller variance), especially with a batch size of 256, leading to a relatively stable variance curve. In Fig 7(b), with a discount factor of 0.9, the model shows its best performance.

thumbnail
Fig 7. Effect of hyper-parameters on mean and variance of rewards in the first 100 episodes.

(a) Results on different discount factors and batch sizes, (b) Results on different learning rates and batch sizes.

https://doi.org/10.1371/journal.pone.0323357.g007

(2) Performance comparison in different network scales (for RQ2)

Fig 8 presents the training curves of several DQN algorithms under two experimental scenarios: one with 50 hosts and another with 100 hosts. It can be observed that as the scale of hosts in the network increases, more episodes are required for various algorithms to converge. Especially when the number of hosts in the network increases to 100, due to the rewards becoming increasingly sparse (the proportion of sensitive hosts is only 2%), the performance differences exhibited by different models become more pronounced. Under the experimental scenarios constructed in this study, the performance of Noisy Network is the worst, followed by Distributional RL; whereas Double DQN and Dueling DQN perform relatively better. The proposed method in this paper converges around episode 500 in scenario 2 and around episode 600 in scenario 3, showing a faster convergence rate compared to other algorithms.

thumbnail
Fig 8. Comparison of training results under different network scales.

(a) Cumulative rewards (Scenario 2), (b) Cumulative rewards (Scenario 3).

https://doi.org/10.1371/journal.pone.0323357.g008

Additionally, we further quantitatively discussed the performance metrics of the aforementioned methods under different network scales: average reward over the first 10 episodes, average steps over the first 10 episodes, and the episode number when the cumulative reward first becomes non-negative. The results are shown in Table 5. It can be seen from the results that the proposed method shows a significant improvement in the episode number when the cumulative reward first becomes non-negative compared to other methods. In scenario 1, the proposed method converges at the episode. In contrast, the fastest among other methods requires 550 episodes (the method based on Noisy Network), and the worst performer is the method based on Double DON, which requires 1200 episodes. When the network scale expands to 50 and 100 nodes, the proposed method converges at the and episodes, respectively. These results also indicate that reinforcement learning-based methods have certain scalability, capable of adapting to larger network scale testing scenarios. Moreover, after integrating various improved strategies and prior knowledge, deep reinforcement learning-based methods can complete testing tasks more quickly. However, the proposed method’s performance in the initial stage is somewhat inferior to other methods. For example, in scenario 1, the average cumulative reward of the first 10 episodes is only −237.70, whereas Dueling DQN achieves −28.30. The proposed method requires 6672 steps in the first 10 episodes, while PER needs only 1993 steps. This result can be understood to some extent as the proposed method focusing on exploring new knowledge rather than just immediate rewards during the initial stage. This contributes to a comprehensive understanding of the system beforehand, improving subsequent execution efficiency, thereby enabling the method to complete the testing work more quickly.

Conclusion

Penetration testing, as a critical approach to information security protection, enhances the cybersecurity defenses capabilities of power IoT systems. In this paper, we introduce an automated penetration testing method based on deep RL technology. This work consists of two parts: a state space construction method that integrates prior knowledge such as system and security knowledge, and a training method based on attention mechanisms and Rainbow DQN. Through the establishment of a simulated testing environment, the effectiveness of the proposed method was verified. Experimental results indicate that the proposed method significantly improves the learning efficiency of the agent during the training process and can adapt to networks of varying scales.

However, limitations remain in this work. For instance, only a limited number of vulnerabilities were injected into the target environment, which may differ from real-world scenarios. Therefore, we plan to continuously update datasets to cover more types of attack methods, and enhance the adaptability of the proposed method in complex scenarios by constructing more sophisticated adversarial simulation environments and participating in cybersecurity competitions. Additionally, although experiments considered networks of different sizes, it remains to be confirmed whether the proposed method applies to realistic or larger scale environments. Integrating the module libraries provided by traditional tools as the foundation for the action space of deep learning agents, and deploying a hierarchical structure of agents is expected to enhance testing efficiency and accuracy. Looking forward, we will continue to explore this field and strive to apply our findings to the real world.

References

  1. 1. Guo P, Xiao K, Wang X, Li D. Multi-source heterogeneous data access management framework and key technologies for electric power Internet of Things. Global Energy Interconnection. 2024;7(1):94–105.
  2. 2. Li J, Herdem MS, Nathwani J, Wen JZ. Methods and applications for artificial intelligence, big data, internet of things, and blockchain in smart energy management. Energy AI. 2023;11:100208.
  3. 3. Dong D, Feng H. Design and use of a wireless temperature measurement network system integrating artificial intelligence and blockchain in electrical power engineering. PLoS One. 2024;19(1):e0296398. pmid:38165871
  4. 4. Zhou K, Gao F, Hou Z, Meng X. Power conversion internet of things: architecture, key technologies and prospects. IEEE Trans Ind Inf. 2024;20(8):10587–98.
  5. 5. Wu K, Cheng R, Xu H, Tong J. Design and implementation of the zero trust model in the power internet of things. Int Trans Electric Energy Syst. 2023;2023:1–13.
  6. 6. Nidhi RK, Pradish M, Suneetha MN. Cyber security analysis of a power distribution system using vulnerability assessment and penetration testing tools. 2024. pp. 17–25. https://doi.org/10.33686/pwj.v20i1.1163
  7. 7. Manzoor J, Waleed A, Jamali AF, Masood A. Cybersecurity on a budget: Evaluating security and performance of open-source SIEM solutions for SMEs. PLoS One. 2024;19(3):e0301183. pmid:38547149
  8. 8. MarketsandMarkets. Penetration Testing Market - Global Forecast to 2027. 2025. Available from: https://www.marketsandmarkets.com/Market-Reports/penetration-testing-market-13422019.html
  9. 9. Banik S, Rogers M, Mahajan SM, Emeghara CM, Banik T, Craven R. Survey on vulnerability testing in the smart grid. IEEE Access. 2024;12:119146–73.
  10. 10. Haq IU, Khan TA. Penetration frameworks and development issues in secure mobile application development: a systematic literature review. IEEE Access. 2021;9:87806–25.
  11. 11. Gandikota PSSK, Valluri D, Mundru SB, Yanala GK, Sushaini S. Web application security through comprehensive vulnerability assessment. Procedia Comput Sci. 2023;230:168–82.
  12. 12. Duppa GIP, Surantha N. Evaluation of network security based on next generation intrusion prevention system. TELKOMNIKA. 2019;17(1):39.
  13. 13. Stepanova T, Pechenkin A, Lavrova D. Ontology-based big data approach to automated penetration testing of large-scale heterogeneous systems. In: Proceedings of the 8th International Conference on Security of Information and Networks. 2015. pp. 142–9. https://doi.org/10.1145/2799979.2799995
  14. 14. Zhou Y, Tang Y, Yi M, Xi C, Lu H. CTI view: APT threat intelligence analysis system. Secur Commun Netw. 2022;2022:1–15.
  15. 15. Leichtnam L, Totel E, Prigent N, Mé L. Sec2graph: network attack detection based on novelty detection on graph structured data. In: Detection of Intrusions and Malware, and Vulnerability Assessment: 17th International Conference, DIMVA 2020, Lisbon, Portugal, June 24–26, 2020, Proceedings 17. Springer; 2020. pp. 238–58.
  16. 16. Chen Z, Kang F, Xiong X, Shu H. A survey on penetration path planning in automated penetration testing. Appl Sci. 2024;14(18):8355.
  17. 17. Sarraute C, Buffet O, Hoffmann J. POMDPs make better hackers: accounting for uncertainty in penetration testing. AAAI. 2021;26(1):1816–24.
  18. 18. Shmaryahu D, Shani G, Hoffmann J, Steinmetz M. Partially observable contingent planning for penetration testing. In: Iwaise: first international workshop on artificial intelligence in security. vol. 33; 2017.
  19. 19. Schwartz J, Kurniawati H, El-Mahassni E. POMDP + information-decay: incorporating defender’s behaviour in autonomous penetration testing. ICAPS. 2020;30:235–43.
  20. 20. Yousefi M, Mtetwa N, Zhang Y, Tianfield H. A reinforcement learning approach for attack graph analysis. In: 2018 17th IEEE International Conference on Trust, Security and Privacy in Computing and Communications/12th IEEE International Conference on Big Data Science and Engineering (TrustCom/BigDataSE). 2018. pp. 212–7. https://doi.org/10.1109/trustcom/bigdatase.2018.00041
  21. 21. Ou X, Govindavajhala S, Appel AW, et al. MulVAL: a logic-based network security analyzer. In: USENIX security symposium. vol. 8: Baltimore, MD; 2005. pp. 113–28.
  22. 22. Zennaro FM, Erdődi L. Modelling penetration testing with reinforcement learning using capture-the-flag challenges: trade-offs between model-free learning and a priori knowledge. IET Inf Secur. 2023;17(3):441–57.
  23. 23. Zhou T, Zang Y, Zhu J, Wang Q. NIG-AP: a new method for automated penetration testing. Front Inf Technol Electron Eng. 2019;20(9):1277–88.
  24. 24. Hu Z, Beuran R, Tan Y. Automated penetration testing using deep reinforcement learning. In: 2020 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW). 2020. pp. 2–10. https://doi.org/10.1109/eurospw51379.2020.00010
  25. 25. Nguyen HV, Nguyen HN, Uehara T. Multiple level action embedding for penetration testing. In: Proceedings of the 4th International Conference on Future Networks and Distributed Systems; 2020. pp. 1–9.
  26. 26. Sultana M, Taylor A, Li L. Autonomous network cyber offence strategy through deep reinforcement learning. In: Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications III. vol. 11746. SPIE; 2021. pp. 490–502.
  27. 27. Zhou S, Liu J, Hou D, Zhong X, Zhang Y. Autonomous penetration testing based on improved deep Q-network. Appl Sci. 2021;11(19):8823.
  28. 28. Li Q, Zhang M, Shen Y, Wang R, Hu M, Li Y, et al. A hierarchical deep reinforcement learning model with expert prior knowledge for intelligent penetration testing. Comput Secur. 2023;132:103358.
  29. 29. Sychugov A, Grekov M. Automated penetration testing based on adversarial inverse reinforcement learning. In: 2024 International Russian Smart Industry Conference (SmartIndustryCon). IEEE; 2024. pp. 373–7.
  30. 30. Wang Y, Li Y, Xiong X, Zhang J, Yao Q, Shen C. DQfD-AIPT: an intelligent penetration testing framework incorporating expert demonstration data. Secur Commun Netw. 2023;2023(1):5834434.
  31. 31. Kaelbling LP, Littman ML, Moore AW. Reinforcement learning: a survey. JAIR. 1996;4:237–85.
  32. 32. Watkins CJCH, Dayan P. Mach Learn. 1992;8(3/4):279–92. https://doi.org/10.1023/a:1022676722315
  33. 33. Hessel M, Modayil J, Van Hasselt H, Schaul T, Ostrovski G, Dabney W, et al. Rainbow: combining improvements in deep reinforcement learning. AAAI. 2018;32(1).
  34. 34. Van Hasselt H, Guez A, Silver D. Deep reinforcement learning with double Q-learning. AAAI. 2016;30(1).
  35. 35. Wang Z, Schaul T, Hessel M, Hasselt H, Lanctot M, Freitas N. Dueling network architectures for deep reinforcement learning. In: International Conference on Machine Learning. PMLR; 2016. pp. 1995–2003.
  36. 36. Schaul T. Prioritized experience replay. arXiv preprint 2015. https://arxiv.org/abs/1511.05952
  37. 37. Sutton RS. Learning to predict by the methods of temporal differences. Mach Learn. 1988;3(1):9–44.
  38. 38. Plappert M, Houthooft R, Dhariwal P, Sidor S, Chen RY, Chen X, et al. Parameter space noise for exploration. arXiv preprint 2017. https://arxiv.org/abs/1706.01905
  39. 39. Bellemare MG, Dabney W, Munos R. A distributional perspective on reinforcement learning. In: International Conference on Machine Learning. PMLR; 2017. pp. 449–58.
  40. 40. Li Q, Wang R, Li D, Shi F, Zhang M, Chattopadhyay A, et al. DynPen: automated penetration testing in dynamic network scenarios using deep reinforcement learning. IEEE Trans Inform Forens Secur. 2024;19:8966–81.
  41. 41. Howland H. CVSS: ubiquitous and broken. Digital Threats. 2022;4(1):1–12.
  42. 42. Si R, Chen S, Zhang J, Xu J, Zhang L. A multi-agent reinforcement learning method for distribution system restoration considering dynamic network reconfiguration. Appl Energy. 2024;372:123625.