Continuous patrolling in uncertain environment with the UAV swarm

The research about unmanned aerial vehicle (UAV) swarm has developed rapidly in recent years, especially the UAV swarm with sensors which is becoming common means of achieving situational awareness. Due to inadequate researches of the UAV swarm with complex control structure currently, we propose a patrolling task planning algorithm for the UAV swarm with double-layer centralized control structure under the uncertain and dynamic environment. The main objective of the UAV swarm is to collect environment information as much as possible. To summarized, the primary contributions of this paper are as follows. We first define the patrolling problem. After that, the patrolling problem is modeled as the Partially Observable Markov Decision Process (POMDP) problem. Building upon this, we put forward a myopic and scalable online task planning algorithm. The algorithm contains online heuristic function, sequential allocation method, and the mechanism of bottom-up information flow and top-down command flow, reducing the computation complexity effectively. Moreover, as the number of control layers increases, this algorithm guarantees the performance without increasing the computation complexity for the swarm leader. Finally, we empirically evaluate our algorithm in the specific scenarios.


Introduction
UAV has rapidly developed in recent years [1,2], such as agricultural plant protection, pipeline inspection, fire surveillance and military reconnaissance. In August 2016, Vijay Kumar put forward the "5s" development trend of UAV, which is small, safe, smart, speed and swarm. Particularly, swarm intelligence [3,4] is the core technology of the UAV swarm, attracting more and more researchers. The study of swarms began in behavior study of insect communities by Grasse in 1953 [5]. For example, the behavior of the single ant is quite simple, but the group of ant colony composed of these simple individuals, shows a highly structured social organization, which can accomplish complex tasks far beyond the individual's ability.
The UAV swarm here is a large scale multi-agent system [6] with the complex relationship. Complex relationships can generate complex behaviors, adapting to complex environments and accomplishing complex missions. Compared to the small-scale multi-UAV system, the UAV swarm holds many new advantages, such as lower cost, higher decentralization, higher a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 time as possible based on our algorithm. Additionally, our algorithm have the realistic significance.
Additionally, the paper is organized as follows. In section 1, we introduce the background of our research. Then in section 2, the relative literatures are reviewed. In section 3, we formally define the UAV swarm patrolling problem. Given this, the UAV swarm patrolling problem is formulated as POMDP in section 4. In this section, the patrolling algorithm is provided to calculate polices for every sub-swarm leader and information gathering UAV. After that, we put forward proof on the decision-making mechanism and corollaries about scalability and performance bound in section 5. In section 6, our algorithm is evaluated through simulation experiments empirically by comparing with benchmark algorithms in the same problem background. Finally, we conclude and point out more research work in section 7.

Related work
In this section, we review related work on dynamic environment model, command and control structures and approaches for the patrolling task planning problem.
Generally speaking, approaches to gather situational awareness without considering threats are typically categorized as the information gathering problem, where agents aim to continuously collect and provide up-to-date situational awareness. One of the challenges in this type of problem is to predict the information at other coordinates in the environment with limited data. As for the environment model, Gaussian Processes [15] are often used in recent years, effectively describing the space-time relationship of the environment. Additionally, topology graph is abstract way to model environment from different perspectives. Compared to topology graph, Gaussian Processes displays more details about environment. However, topology graph abstracts the core elements of environment, which is helpful to concentrate on research object. As for the environmental dynamic, most environment models are static in previous work [16]. Presently, Markov chain are widely used to model non-static random environment objects, such as the physical behavior of the wireless network [17], the storage capacity of the communication system channel [18] and communication channel sensing problem [19]. In these papers, the Markov model is added some different assumptions. As for patrolling problem with the UAV swarm, the Markov chain is one of the most popular models. For instance, the ground target is modeled as an independent two-state Markov chain in paper [20]. Paper [21] models the patrolling environment with threat state and information state as K-state Markov chain. Paper [22] uses the Markov chain to represent hidden movement of targets. In this paper, we assume the UAV swarm patrolling environment is a topology graph, changing with K-state Markov chain.
Due to the large number of UAVs, the command and control structure of the UAV swarm should be taken into consideration. Nowadays, there are many control structures about innerloop controller in UAV [23,24], which are different from our research. What we concern is the relationship among UAVs in the swarm. Generally speaking, control structures of the swarm can be divided into general structure and computable structure. General structures are coarse granularity, which can be applied to a variety of fields. In general structures, research object is described by qualitative method, lacking quantitative analysis. The AIR [25] model divides control structures into four basic patterns: the directed control structure, the acknowledged control structure, the virtual control structure and the collaborative control structure. The 4D/RCS [26] model provides a theoretical basis for unmanned ground vehicles on how their software components should be identified and organized. The 4D/RCS is a hierarchical deliberative architecture that plans up to the subsystem level to compute plans for an autonomous vehicle driving over rough terrain. Paper [27] proposes a scalable and flexible architecture of real-time mission planning and dynamic agent-to-task assignment for the UAV swarm. Compared to general control structures, computable control structures are fine granularity and quantitative. For example, aiming at centralized control structure and decentralized control structure, paper [28] introduces three methods to solve the cooperative task planning for the multi-UAV system. Paper [29] proposes a task planning method of singlelayer centralized control structure in dynamic and uncertain environment. However, most of computable control structures are single-layer presently. Thus, in order to effectively manage large-scale UAVs, the computational control structures with complex relationship should be taken into consideration.
There are many approaches to solve the task planning problem [30], such mathematical programming, Markov decision process (MDP) and game theory [28]. As for the continuous information gathering problem, MDP based algorithms are more appropriate due to the property of multi-step programming. For instance, in fully observable environments, paper [31] proposes a MDP based algorithm to compute policies for all the UAVs. Moreover, POMDP and Dec-centralized POMDP (Dec-POMDP) [32] are widely used to partially observable environments. However, most of the researches on patrolling problem of the UAV swarm are single-layer control structures. And our work in this paper mainly extends to double-layer control structure. Due to the exponential growth in the number of possible course of actions of UAVs, solving this formulation using current POMDP solvers [33] is hard. Partially Observable Monte Carlo Planning (POMCP) [34] extends some benchmark algorithms to solve multi-agent POMDPs. POMCP breaks up curse of dimensionality and the curse of history, providing a computationally efficient best-first search that focuses its samples in the most promising regions of the search space. However, as for the large scale multi-agent patrolling problem, the state space is still too large to apply POMCP into multi-POMDP problem directly.

The UAV swarm patrolling problem
In this section we present a general patrolling problem formalization of the UAV swarm with double-layer centralized control structure. Here, we introduce the patrolling problem of the UAV swarm in three aspects: overview of the patrolling problem, the physical environment and patrolling UAVs.

Overview of the patrolling problem
The environment is modeled as the upper-layer environment and the lower-layer environment for different decision makers. The upper-layer environment and lower-layer environment correspond to the same real environment. The control structure falls into the upper-layer control structure and the lower-layer control structure for different decision makers. There are three types of UAVs: the swarm leader, the sub-swarm leader and the information gathering UAV (I-UAV for short). The swarm leader and the sub-swarm leader are decision makers. The upper-layer environment, lower-layer environment and three types of UAVs are shown in Fig 1. The lower-layer environment provides information for sub-swarm leaders, and I-UAVs follow sub-swarm leaders' command. After that the sub-swarm leaders provide information for the swarm leader and follow the swarm leader' command. The difference between two layers is granularity of time, layout graph, action, and information belief.
The swarm leader is represented by a blue hexagon. There are several sub-swarms in a swarm, and every sub-swarm contains several I-UAVs. The leader of a UAV swarm is called as the swarm leader, while the leader of a sub-swarm is called as the sub-swarm leader, and UAVs which are directly subordinate to the sub-swarm leader are called as I-UAVs. In reality, the swarm leader may be a high intelligence UAV in the UAV swarm, a ground control station, or an early warning machine. The main function of the swarm leader is to allocate course of actions for each sub-swarm leader. Sub-swarm leaders are represented by yellow five-pointed stars. They play the role of actor in the upper-layer environment, while they are decision makers in the lower-layer environment, leading a sub-swarm and allocating the course of actions for each I-UAV. I-UAVs are represented by red rhombus, directly controlled by their superior sub-swarm leader. The function of I-UAV is to collect environmental information. Additionally, the upper-layer control structure and lower-layer control structure are both centralized control structure, and there are no interactions between UAVs with peering relationship. Here, let l denote the lower-layer environment, let h denote the upper-layer environment, let u denote the sub-swarm leader, and let w denote the I-UAV. Moreover, the meanings of symbols in this paper are shown in Table 1.

The physical environment
The physical environment is defined by its spatial-temporal and dynamic properties, encoded by the lower-layer environment and upper-layer environment based on the control structure, specifying how and where UAVs can move. In fact, the physical environment is an interested area for people like a mountain forest, a battlefield, or a farmland, where people need urgent continuous intelligence information. Each vertex in undirected graph refers to an area in reality, and edge indicates it is connected between two vertices. Definition 1 (Layout graph) The layout graph is an undirected graph G = (V, E) that represents the layout of the physical environment, where the set of spatial coordinates V is embedded in Euclidean space, and the set of edges E denotes the movements that are possible.
Our model contains the upper-layer layout graph and lower-layer layout graph, denoted as G h and G l separately. The upper-layer layout graph and lower-layer layout graph corresponds to the same physical environment. There is a correspondence between G h and G l . Definition 2 (Information level) The information level qualitatively represents the content of interested information, denoted as I k 2 {I 1 , I 2 , . . ., I K }, where K is the number of levels. The information level vector is denoted as I = [I 1 , I 2 , . . ., Each vertex has a certain information level at a time. We regard the physical environment is dynamic and partially observable. So the information level of each vertex changes with time. Specifically, an I-UAV can only access the current location in G l and gather the information. When an I-UAV visits a vertex, the information level of this vertex will be reset to I 1 . In other words, there is no more new information after the most recent visiting.

Definition 3 (Information value)
The information value is a quantification of information level, denoted as f(I k ), I k 2 {I 1 , I 2 , . . ., I K }. Function f : I k ! R þ assigns information level to information value. The information value vector is denoted as F = [f(I 1 ), f(I 2 ), . . ., f(I K )].
In order to reduce the decision complexity of the swarm leader, the significant and interested information are extracted from the lower-layer layout graph. Moreover, information value of vertices where UAVs haven't visited for some time may increase. Thus, we regard that the function f(Á) is monotonically increasing. And the information value transition matrix P is Table 1. A summary of the notation used throughout this paper.

G
An undirected graph encoding the physical layout of environment (Definition 1)

I
The information level of each node (Definition 2)

f(I)
The information value of each node (Definition 3) γ The discounting factor C k The policy set of k allocated agents in sequential allocation method as follows: Assumption 1 The change of information value obeys the independent and discrete-time multi-state Markov chain according to Eq 1.
Here, we assume the information value state transition matrix P is known in advance. P h represent the upper-layer transition matrix, while P l represents the lower-layer transition matrix. Additionally, the concept of stochastic dominance is widely used in many applications [35], such as economy, finance, and statistic. Specifically, the stochastic dominance of two Kdimension vectors x = {x 1 , x 2 , . . ., x K }, y = {y 1 , y 2 , . . ., y K } is defined as x 1 y, if:

Assumption 2 Information value vector F follows stochastic dominance.
Intuitively, if a vertex v has higher information value than other vertices currently, the vertex v might have higher information value at the next moment.

Assumption 3 Information value transition matrix P is a monotone matrix.
Generally, if there are no UAVs gathering information in an area, the unknown information of this area may increase with time. The monotone matrix [36] P satisfies: As for two compact information belief vectors (See Eq 13) b n and b n 0 , if b n 1 b n 0, then b n ÁP 1 b n 0 ÁP [17]. If there are no UAVs visiting vertex v n and v n 0 at the moment, their information belief vectors will also maintain stochastic dominance at the next moment. Additionally, if b n 1 b n 0 , then b n ÁF ! b n 0ÁF, which means that the belief vector with stochastic dominance may have higher information value.

Definition 4 (Time)
Time is modelled by a discrete set of temporal coordinates t 2 {0, 1, . . .}, henceforth referred to as time steps.
The lower-layer time step is denoted as t l , and upper-layer time step is denoted as t h . Here, a time step contains a OODA (Observation, Orientation, Decision, Action) for all the agents with the same layer. And there is a correspondence between them.
Definition 5 (Time Step Ratio) Time step ratio is the ratio of the real time of one upper-layer time step to that of one lower-layer time step, denoted as M.
The relationship between upper-layer time step and lower-layer time step is Definition 6 (Corresponding Relationship of Time) Let function Θ t (Á) and Y À 1 t ðÁÞ denote the corresponding relationship of time.
The corresponding relationship between t h and t l is as follows. Where Floor denotes the fraction is rounded down. We use the term "region block" to represent a square area in graph G l . Each vertex v in G h corresponds to a region block. The length of region block is d r , including d r × d r lower-layer vertices.
Example 2 The lower-layer layout graph G l includes 300 × 200 vertices, and the region block G r include 20 × 20 vertices. Then the upper-layer layout graph G h is a rectangular area with 15 × 10 vertices. Therefore, the hierarchy environment greatly reduces the decision-making complexity for the swarm leader.

The patrolling UAVs
There are three types of UAVs, namely, the swarm leader, the sub-swarm leader, and the information gathering UAV.

Definition 8 (Swarm leader) A swarm leader is an entity capable of making decisions for all the sub-swarm leaders.
The role of the swarm leader is to manage the whole UAV swarm, whose function is similar to the ground workstations, or early warning aircraft. However, because of the hierarchy control structure, the swarm leader mainly focuses on the state of sub-swarm leaders and upperlayer environment. In this paper, we regard that the communication ability between subswarm leaders and the swarm leader is strong enough, regardless of the communication distance between them.

Definition 9 (Sub-swarm leader) A sub-swarm leader is a physical mobile entity capable of making decisions for its subordinate UAVs. The sub-swarm leader is denoted as u
The behaviors of a sub-swarm leader can be divided into decision-making part and actionexecuting part. The sub-swarm leader is an actor in G h , following the command of the swarm leader. However the sub-swarm leader becomes a decision maker in G l , controlling several I-UAVs. Briefly speaking, the sub-swarm leader plays a role of bridge, connecting the swarm leader and I-UAVs.
Actions of sub-swarm leaders are atomic in G h . It means that the sub-swarm leader can move from a upper-layer vertex v h i to its neighboring vertex v h j 2 adj G h ðv h i Þ at a time step t h . Meanwhile, different sub-swarm leaders can visit the same vertex at the same time. The sub-swarm leader performs the same actions in G l as it performs in G h . Based on formula 5, the time step ratio M is no less than the side length of region block d r in order to ensure that the sub-swarm leader can reach the target area timely. In this paper, we set M = d r .

Definition 10 (Information Gathering UAV) An information gathering UAV (I-UAV for short) is a physical mobile entity capable of taking observations. The I-UAV is denoted as
I-UAVs collect environment information by visiting lower-layer vertices. I-UAVs are distributed in lower-layer layout graph G l , and different I-UAVs can visit the same lower-layer vertex v l at the same time. The movement of the I-UAV in G l is atomic. We assume that the I-UAV is fast enough to move from one vertex to its adjacent vertex at a time step in reality. In addition, the cost of I-UAV movement is not taken into account in the paper.
If an I-UAV visits a lower-layer vertex v l , it will automatically gather current information value of this vertex. After visiting, the information level of vertex v l will be reset to I 1 , indicating that the information of v l has been collected and no new information currently. Nevertheless, the environment changes dynamically based on formula 1 with time. I-UAVs can only access the information at the moment, which cannot observe the state of vertex at the next moment.

Assumption 4 The communication capacity of I-UAVs is limited. The feasible area of I-UAVs is a square region block centering on the current position of their superior sub-swarm leader.
In other wards, the feasible area for I-UAVs moves with the movement of the sub-swarm leader in G l .
Definition 11 (Corresponding Relationship of Action) Let Θ a denote the action corresponding relationship of sub-swarm leader between the upper-layer layout graph and the lowerlayer layout graph: For the convenience of description, we define the concept of a team. There are two types of teams in our model: teams of I-UAVs and teams of sub-swarm leaders. Policies of agents are decided by the team leader. The team leader is the swarm leader in G h , while it is the subswarm leader in G l .

Definition 12 (Team)
The team is a multi-agent system with single-layer centralized control structure.

The UAV swarm myopic patrolling algorithm
As for the centralized control structure, the information flow is bottom-up, while the control flow is top-down. In this section, we introduce the UAV swarm patrolling algorithm from the aspect of control flow. Given the problem described in previous sections, we first instruct the multi-agent patrolling formulation. Then we introduce the objective of patrolling problem. After that, we introduce the UAV swarm patrolling algorithm.

Team of agents patrolling model
The swarm patrolling model can be divided into multiple sub-swarm leaders patrolling model and multiple I-UAVs patrolling model. Because they have the same control structure and the similar environment, the formula of multiple sub-swarm leaders patrolling model and multiple I-UAVs patrolling model are similar. Without loss of generality, we take a team for example. The team leader obtains joint observation values, takes the joint actions, and gets the joint return values. So, the patrolling problem of multi-agent patrolling can be modeled as MPOMDP problem, while MPOMDP problem can be regarded as a POMDP problem, which is denoted as hS, A, O, F, O, R, Bi: • S is the joint state set of all the agents in the team, including the joint position state set and joint information state set, denoted as S = [S V , S I ]. A joint position state is denoted as . . . ; s I U 2 S I , and a joint information state is denoted as • A is the joint action set of all the agents in the team. A joint action state is denoted as a = [a 1 , a 2 , . . ., a U ] 2 A. The team leader determines what actions agents should perform. Specifically, the action for an agent is the movement from current vertex to its adjacent vertex or remaining in its current vertex.
• O is the joint observation set of all the agents in the team, which is denoted We set o = s, which means the observation is equal to the current information state.
• F is the joint state transition function set, including position state transition function and information state transition function, denoted as . . . ; F I jVj . As for the position transition function, a agent can reach to the target neighbour vertex. The position state transition function is as follows: Where s V goal denotes the expected destination. In addition, the information state transition function is as follows: Where s I goal denotes the expected target state. The transition function is based on Eq 1.
• O is the joint observation function set of all the agents in the team, denoted as The observation function is as follows: • R is the joint reward function set of all the agents in the team, denoted as The reward of swarm is equal to the sum of rewards of all the sub-swarms. The reward of sub-swarm is equal to the reward of I-UAVs. And the reward of I-UAV is equal to the information value of vertex which is visited currently.

RðtÞ ¼ f ðs I ðtÞÞ ð9Þ
• B it the compact information belief vector, which is the compact representation of standard information belief vector. The standard information belief vector is the posterior probability distribution over the possible information states. The belief is proposed according to the assumption 1, that information state of the vertex changes independently. The standard information belief is a sufficient statistic for the design of the optimal policy for any time step [37]. And compact information belief B is the equivalence description of standard information belief [29]. The formula is as follows: Without loss of generality, we take vertex v n for example. The formula of its belief is as follows: b n ðtÞ ¼ ½p n I 1 ðtÞ; p n I 2 ðtÞ; . . . ; p n I K ðtÞ ð11Þ Where p n I k ðtÞ is the posterior probability of information level I k at time step t, and P K k¼1 p n I k ðtÞ ¼ 1. Now the number of all information states of lower-layer environment reduces to P jVj n¼1 K n , decreasing the computation complexity and memory complexity significantly. The update function of b is as follows: Where Λ denotes the unit vector that the first element is 1; v is the vertex visited by agent, and v n is a vertex in G. Moreover, let B l denote the lower-layer compact information belief (L-belief for short). Let B h denote the upper-layer compact information belief (H-belief for short). The upper-layer information belief derives from lower-layer information belief, let Θ b (Á) denote the relationship between H-belief and L-belief: Where t h = Θ t (t l ). The qualitative criteria about the extracted method, it is to reduce the computation complexity, at the same time, contain the sufficient and key information. So we use the method of average filter, which is brief, at the same time, containing the general information of lower-layer environment. Specifically, taking an upper-layer vertex v h n (corresponding to a region block) for example, the relationship between H-belief and L-belief is as follows: Where t h = Θ t (t l ), and N r is the number of lower-layer vertices in the region block. p l;i I k ðt l Þ represents the probability that the information level of vertex v l i is I k at t l , and p h;n I k ðt h Þ is the probability that the information level of v h n is I k at t h . Example 4 Taking a region block for example, it corresponds to upper-layer vertex v h n . This region block includes four lower-layer vertices, denoted as fv l

The objective of patrolling problem Definition 13 (Policy) The policy is a set of course of actions made by the team leader, denoted as π.
In addition, let π D denote the policy that the horizon of team leader (number of time steps that we look ahead) is D. Let D h denote the horizon of the swarm leader, and D l denote the horizon of the sub-swarm leader. The policy for an agent is defined as follows: Moreover, the patrolling objective of the UAV swarm is to acquire the maximum reward. Our algorithm is to find policies which can acquire the maximum reward. The formula is as follows: Where, R l i;j ðo l ðt l ÞÞ is the reward of I-UAV w i,j when the observation is o l (t l ), U is the number of sub-swarms, W i is the number of I-UAVs in the i-th sub-swarm, and γ 2 [0, 1] is the discount factor.

The swarm patrolling algorithm
In this section, we introduce the patrolling algorithm. Firstly, we propose the patrolling algorithm of single agent. After that, the team of agents patrolling algorithm is put forward based on single agent patrolling algorithm. Finally, we put forward the swarm patrolling algorithm.

Single agent patrolling algorithm (SAPA).
To effectively predict the information value state of layout graph, we use the character of environment. Based on formula 1, we know the information value state transition property. So we propose a heuristic function to predict the reward after performing policy, which is denoted as H(t). The heuristic function is as follows: HðtÞ ¼ Wherebðt þ kÞ is the expected belief of the vertex, which may be visited by agents at t + k.
The update function ofbðt þ kÞ is based on formula 12. However, the information transition matrix P is different between teams of sub-swarm leaders and teams of I-UAVs. If the agent is an I-UAV, it means the team leader is the sub-swarm leader in G l , then P l = P. If the agent is a sub-swarm leader, it means the team leader is the swarm leader in G h , then P h = P M . Because the time step ratio is M. 4.3.2 Team of agents patrolling algorithm (TAPA). The team of agents patrolling problem is a MPOMDP problem, which can be simplified as POMDP. The joint action space of the POMDP is the Cartesian product of the action of all sub-swarm leaders. Generally, it is hard to solve this formulation due to its huge state space. In order to duel with this problem, sequential allocation method is used to decrease the state space. As for sequential allocation method, there are two types of double counting: synchronous double counting and asynchronous double counting.
The synchronous double counting is that a vertex is visited by different agents at the same time. In this condition, the environment information will be redundant counting. We regard that the first I-UAV which is allocated to visit the vertex will acquire the information value. However, the other I-UAVs visiting the vertex will get nothing.
The asynchronous double counting is that, the j-th (i < j) agent makes a decision to visit vertex v at t 1 (t 1 < t 2 ) after the i-th agent having decided to visit this position at t 2 , where t 1 , t 2 2 {0, 1, . . ., D − 1}. In this condition, the expected value of vertex v is high-valued. Because the jth agent doesn't consider the i-th agent has decided to visit the vertex. So the penalty factor is to reduce the expected information value of vertex v for the j-th agent.
Definition 14 (Penalty Factor) The penalty factor, denoted asp, is the difference value between the expected reward and revised expected reward in the condition of asynchronous double counting.
The formula is as follows:p Where r expected 2 R þ denotes the expected reward of the i-th agent without regard to the visiting of the j-th agent. r revised 2 R þ denotes the revised expected reward of the i-th agent with regard to the visiting of the j-th agent. The formula is as follows: Wherebðt 2 Þ denotes the revised H-belief or L-belief at t 2 , which is as follows:

Definition 15 (Revised Heuristic Function)
The revised heuristic function (H-function for short) is a heuristic function adding in the penalty factor, denoted asHðÁÞ.
The formula is as follows:H Wherep all is the sum of penalty factors when evaluating a policy. Now we describe the process of sequential allocation algorithm. Firstly, the allocation sequence of agents is sorted randomly. Secondly, we calculate the optimal policy of all the agents sequentially. When calculating the revised expected reward of the k-th agent, it should take its current position v k (t), information belief vector B(t) and calculated optimal policies C Ã kÀ 1 into account. The formula of revised expected reward is equal to the revised heuristic function: for π 2 ∏ D (t) do 7: CalculatingB andB from t to t + D − 1 8: Calculating the revised expected rewardR of π 9: Calculating the π with π Ã , restoring the optimal policy 10: end for 11: Restoring the optimal policy and path in C 12: end for 13: Returning actions a(t) of all the agents 14: end function The sequential allocation method is to greedily compute policies for each single agent sequentially, instead of computing a joint policy for the team. The sequential allocation method [31] for multiple agents is defined as follows: . . .
WhereRðÁÞ is the revised expected reward function. B(t) is compact belief vector of vertices at t. C Ã k is computed optimal policies from 1-th agent to k-th agent, denoted as . . . ; p Ã k g; k 2 f0; 1; . . . ; K À 1g, C Ã 0 ¼ ;. The procedure of the team of agents patrolling algorithm see 1.
In the beginning, the new information belief vector B(t) is computed based on formula 12 (for the lower-layer vertices) or formula 13 (for the upper-layer vertices). Then the optimal policies of all the agents are calculated sequentially: firstly, all the feasible polices is calculated according to assumption 4; secondly, the expected belief vectorB and revised expected belief vectorB are calculated according to Eq 19; thirdly, the revised expected reward is calculated according to 21; fourthly, after comparing the revised expected reward with the restored maximum reward, the optimal policy is updated and is restored in C.

The UAV swarm patrolling algorithm (USPA).
The information flow and command flow are two main interactive processes between different layers. In specific, the information flow is a bottom-up process, the control flow is an top-down process.
Firstly, I-UAVs visit the lower-layer vertices and collect information value. The sub-swarm leaders calculate the information belief of all the vertices and transfer the lower-layer information belief B l (t) to the swarm leader. After that the swarm leader calculates the upper-layer information belief vector B h (t). The function of updating the lower-layer belief vector B l (t) is based on formula 12, and the function of calculating upper-layer belief vector B h (t) is based on 13. Secondly, the swarm leader makes decisions for all the sub-swarm leaders. The sub-swarm leader then makes decisions for its agents. The algorithm to calculate policies π of agents is based on algorithm of TAPA (See algorithm 1). The UAV swarm patrolling algorithm is 2.

Theoretical analysis
In this section, we analyse the performance of SAPA, TAPA, and USPA. Firstly, the performance of SAPA is qualitatively analysed. Then the performance of TAPA is analysed based on theory 1 and corollary 1. After that, we analyse the performance of USPA through corollary 2 and corollary 3.
As for the single UAV patrolling, it is an open problem to design patrolling algorithm for each UAV. As for the SAPA, it maybe not the optimal policy. However, it is a myopic policy, using the dynamic property of environment, which still has heuristic capability. In particular, SAPA is time-saving compared with POMCP [34].
As for the TAPA, sequential allocating method is used to calculate policies, instead of computing the joint policies. The collected information value satisfies the property of monotonically increasing and diminishing increment [38]. So our model still guarantee the lower limit of performance compared with joint policies [31,39]. Here, the method of joint policies is to calculate the best reward of Cartesian product policies of all the agents.
The accumulated function for the swarm leader is defined as follows: The accumulated function for the i-th swarm leader is defined as follows: Theory 1 Let f : 2 E ! R be a non-decreasing sub-modular set function [31]. The greedy algorithm that iteratively selects the element e 2 E that has the highest incremental value with respect to the previously chosen elements I 2 E: Until the resulting set I has the desired cardinality k, has an approximation bound f ðI G Þ f ðI Ã Þ at least BoundðkÞ ¼ 1 À kÀ 1 k À Á k , where I Ã & E is the optimal subset of cardinality k that maximises f.
Proof Function Q h (u) and Q l i ðwÞ can be separated based on some conditions. When we just take upper-layer environment into consideration, Q h (u) is an independent function. When the swarm leader have made a decision and it is sub-swarm leader's turn to make decision, Q l i ðwÞ is an independent function. Due to the same decision-making mechanism, without loss of generality, we take Q h (u) for example. The non-decreasing property shows the fact that adding more agents never reduces the observation value they receive as a team (since existing agents do not change their policies). To prove the submodularity, for every set of policies π 0 π 00 X, and policy π = 2 π 00 , π 2 X, the formula 27 holds.
Without loss of generality, we take policies π , for example. The right hand side of formula is equal to: While the left hand side of formula is equal to: Qðp [ π 00 Þ À Qðπ 00 Þ ¼ Qðpjπ 00 Þ þ Qðπ 00 Þ À Qðπ 00 Þ ¼ X DÀ 1 i¼0 ðg i Áb π 00 ðt þ iÞ Á FÞ Àp π 00 Generally speaking, to prove that this holds, we just need to prove that adding a policy π to a set of policies π 00 instead of π 0 reduces reward and increases penalty. It may occur two situations when a new policy π is added into the π 00 .
In the second situation, there are some path cross points between π and π 00 − π 0 . There are two cases for their path cross points, including t c1 t c2 and t c1 > t c2 . Where t c1 is the time visited by π 00 − π 0 and t c2 is the time visited by π.
Patrolling with the UAV swarm Thus, the formula 27 is satisfied and corollary 1 is proved.

Corollary 2 The reward lower bound of centralized control model with k layers is
Á k of the optimal reward. Proof Without loss of generality, we take the double-layer control structure for example. The information flow is bottom-up, summarized to the swarm leader and the swarm leader will give an evaluation of the whole reward. In corollary 1, we prove the performance bound of different accumulated functions independently. Here we take them as a whole.
In the upper-layer control structure, the region block is corresponding to a upper-layer vertex and each sub-swarm is regarded as a mobile entity. When the swarm leader makes decision, it regards that each sub-swarm can gather the optimal reward of the region block. Nevertheless, we use the sequential allocating method for all the sub-swarm leaders. The approximate lower bound is: Where, W is the number of I-UAVs in the sub-swarm. As for W ! 1, BoundðWÞ ¼ 1 À 1 e . It means the sub-swarm leader can gather at least 1 À 1 e of the joint policy reward in the region block. Similarly, the sequential allocating method is also used in the decision process of the swarm leader. The approximate lower bound is: Where, U is the number of sub-swarm leaders. As for U ! 1, the corollary 2 is proved.

Corollary 3 As the number of UAVs increases, the computation complexity for the swarm leader will not change.
Proof Without loss the generality, we take a UAV swarm with l layers for example. Let each decision-making node manage N sub-nodes. The horizon for each decision-making node is D and each action has K choices. So there are N + N 2 + . . . + N l nodes (except the swarm leader) in the swarm. When making decisions for a sub-node, the number of possible action states is K D . However, if the swarm leader makes decisions for all the nodes in the swarm by sequential computing method, the number of action states is: In this paper, we allocate the decision-making process of the swarm leader to all the decision-making nodes. Each node only cares about behaviors of its direct sub-nodes. So the number of states for a decision-making node is N Á K D . In other words, our algorithm greatly reduces the computation complexity for the swarm leader. Thus, the corollary 3 is proved.

Empirical evaluation
In this section, we evaluate the performance of our algorithm in an abstract multi-agent information gathering problem. Firstly, the case experiment is conducted by setting experience parameters. Secondly, we perform parameter sensitivity analysis experiment based on the case experiment. In the experiments, we focus on the macro planning process, other than how to control each UAV.

Case experiment
We consider a disaster response scenario where an earthquake happened in a suburban area [40], where rescuers need urgent continuous intelligence information. This section includes problem statement, calculation expectation, experiment setup, and experiment result.
6.1.1 Problem statement. Earthquake has catastrophic effects on people. After earthquake, ground infrastructures in disaster area may be destroyed. The UAV swarm is one of the most effective ways to acquire the latest real-time information quickly. In this scenario, a UAV swarm with large scale of UAVs, is allocated to gather the newest information about the unknown environment. We assume the UAV swarm has good communication quality, and some unforeseen circumstances are not taken into consideration, such as communication interrupt, mechanical breakdown and other problems. It is note that, we focus on the patrolling problem from the perspective of high level. The environment is modeled as the layout graph, and information attached to the vertex. The vertex in layout graph corresponds to an area in the reality world.
To effectively manage the UAV swarm, the swarm requires the command and control structure. In this scenario, we focus on the double-layer centralized structure. There is one swarm leader in the UAV swarm, making decision for several sub-swarm leaders. Each subswarm leader controls its information gathering UAVs. The total process is as follows: firstly, the information gathering UAVs collect environment information; the sub-swarm leader then calculates the information belief of its layer, and transfers it to the swarm leader. Secondly, the swarm leader calculates the information belief of the total environment, and makes decision for the sub-swarm leaders; after that the sub-swarm leaders make decisions for their subordinate information gathering UAVs.
6.1.2 Calculating expectation. Some performance indicators, such as information value and time are evaluated through experiments. In this experiment, we mainly take the total information value and the swarm leader decision time into consideration. On one hand, the goal of our model is to collect information as much as possible. The total information value gathered by I-agents reflects the overall situation of the algorithm. On the other hand, the decision time is an important performance indicator to evaluate the computation complexity of the algorithm. Meanwhile, we compare our algorithm with other algorithms. Theoretically, our algorithm not only gathers much information, but also has less computing time for each decision maker.
There are three algorithms in Section 4. Intuitively, the team of agents patrolling algorithm (TAPA for short) consists of many single agent patrolling algorithms (SAPA for short), while the UAV swarm patrolling algorithm (USPA for short) is made up of several team of agents patrolling algorithms. Thus, we benchmark against a random algorithm and a baseline algorithm with USPA. Specifically, these algorithms are as follows: • USPA represents UAV swarm patrolling algorithm. The UAV swarm has double-layer centralized command and control structure in USPA.
• POMCP represents Partially Observable Monte Carlo Planning [34]. It is a promising approach for online planning, and it can efficiently search over long planning horizons. The UAV swarm has single-layer centralized command and control structure in POMCP.
• RA represents the random algorithm. The agent moves to a random position adjacent to or remain at the agent's current position. The UAV swarm has single-layer centralized command and control structure in RA.

Experiment setup.
Parameters are set based on experience. We first introduce the parameters of lower-layer environment of USPA, which corresponds to the parameters of environment of POMCP and RA. Because it is single layer environment in POMCP and RA. Then we describe the specific parameters of upper-layer environment of USPA.
The lower-layer environment is modeled as lower-layer layout graph G l . Let the target area be 40 million square meters, and each lower-layer vertex v l corresponds to an area with 10 thousand square meters. The lower-layer layout graph is modeled as 400 vertices, where the side length is 20 vertices. Each vertex v l has information, called as information level and information value. In the disaster response scenario, the newest information, such as the damage degree of building, road and people, needs to be collected and merged into a situation map of disaster situation. Due to the weather and aftershock, the environment may change dynamically and uncertainly. Thus, the disaster situation information of target area will change dynamically with time. Here we focus on the change degree of information. Intuitively, the larger the change degree, the more new information the area may contain. The information level is modeled as five levels, I 1 = no new information, I 1 = few new information, I 3 = some new information, I 4 = lots of new information, I 5 = completely new information. The corresponding information value vector is set as f(I) = [0, 1, 2, 3, 4]. The initial information levels of all vertices are set as I 1 . The UAV is abstracted as a patrolling agent, moving on the layout graph. 30 Iagents are allocated on the layout graph. We assume that it takes 5 minutes to complete a OODA (Observation, Orientation, Decision, Action) process for all the agents in this layer. That means one time step t l corresponds to 5 minutes in real world. The horizon D l for the leader is 1. Additionally, the reward acquired by agents is equal to the information value of the vertex at the moment. Let the discount factor be γ = 0.9. In order to predict the information in other vertices, the information beliefs are necessity. Let the initial information beliefs of all vertices be Λ = [1, 0, 0, 0, 0], following the same information value transition matrix P l . The matrix P l is as follows: The upper-layer environment is modeled as layout graph G h . In our model, the upper-layer environment and lower-layer environment correspond to the same target area. However, the time, action, layout graph, and information belief are different. There are some corresponding relationships between them. The corresponding relationships of time, layout graph, actions and information belief are described in definition 6, 7, 11, and 14 respectively. In the upperlayer environment, the swarm leader is the decision maker, and sub-swarm leaders are actuators; while in the lower-layer environment, the sub-swarm leaders are decision makers, and Iagents following the commands of the sub-swarm leaders. Let each upper-layer vertex v h correspond to 1.6 million square meters. The upper-layer layout graph has 25 upper-layer vertices, where side length is 5 vertices. Each upper-layer vertex corresponds to 16 lower-layer vertices. In addition, the information of upper-layer vertex is different from that of lower-layer vertex. In upper-layer vertex, it just has information belief, other than specific information level. Because each upper-layer vertex contains many lower-layer vertices with different information levels. Additionally, each sub-swarm has 3 I-agents, and 30 I-agents are divided into 10 sub-swarms. In the upper-layer environment, one time step t h corresponds to 20 minutes, and the time step ratio M is 4. Let the horizon D h be 1. And the upper-layer information value transition matrix P h is P M l . We run 20 rounds for each algorithm, and 400 lower-layer time steps t l for each round. After that, performances of each algorithm are evaluated by these performance indicators. The algorithms run on a machine with 2.5 GHz Intel dual core CPU and 8 GB RAM.
6.1.4 Experiment result. Fig 3 shows the total information value. The y axis represents total information value gathered by 30 I-agents. From this figure, the performance of USPA is 36.77% larger than that of RA. However, the computer memory is not enough to calculate POMCP. In deed, each I-agent has about 5 neighbours in each vertex, and each vertex has 5 information levels. So the joint action space and the joint observation space are near 5 30 . It is hard to evaluate the performance of POMCP in this scenario. Table 2 shows the decision time for the swarm leader. The second row represents the average time that the swarm leader makes a decision for its direct subordinates. The third row represents mean square error (MSE for short) for 20 rounds. The unit of average time and MSE is seconds. The symbol "-" is used to indicate the memory space is exceeded. It shows that the run time of RA is much lower than that of USPA. However, the difference between the run time of RA and that of USPA is not great from the macro perspective. Because a lower-layer time step is set as 5 minutes in this scenario.  In general, as for the information gathering problem in earthquake area, USPA can be applied to the UAV swarm with large scale UAVs theoretically. Because, USPA meets the expectations in this scenario, that the decision time for the swarm leader is quite short and the total reward is high enough.

Parameter sensitivity analysis experiment
The parameter sensitivity analysis experiment is based on the background of case experiment. In this section, we mainly evaluate some parameters which may influence the performance indicators. Some parameters is adjusted to evaluate whether the USPA meets the expectations. In specific, these parameters are the number of sub-swarms (NoS for short), number of layers (NoL for short), and horizon. Then the practical value is summarized based on experiment results.

Evaluation of the number of sub-swarms.
In this scenario, the number of I-agents in a sub-swarm is fixed at 3. Then the total number of I-agents changes with the number of sub-swarms. Additionally, other parameters are the same with that in case experiment. We construct 6 scenarios, which is as follows. We compare USPA with POMCP and RA. Fig 4 shows the total information value acquired by I-agents. There are 6 figures in the figure, each figure shows the result of a scenario. The y axis represents the total information value acquired by all the I-agents. From these figures, we can find that the reward increases monotonously as the number of sub-swarms increases. Because the total number of I-agents increases, which can gather more information. Additionally, in ScenarioA, the reward of POMCP is 10.63% better than that of USPA, while the reward of USPA is 67.18% better than that of RA. However, as the number of sub-swarms increases, the joint action space and joint observation space increase exponentially. It is beyond the memory space of machine. So it is hard to conduct experiment based on the POMCP. Generally compared to POMCP and RA, the reward of USPA is slightly less than that of POMCP, and better than that of RA. Table 3 shows the average decision time of the swarm leader and its mean square error. The unit of time is seconds. The symbol "-" is used to indicate the memory space is exceeded. From the table, we know that as the number of sub-swarms increases, the decision time of the swarm leader will increase synchronously. From a macro perspective, there is little difference between the run time of RA and that of USPA.

Evaluation of the number of layers.
From section 6.2.1 we know that as the number of I-agents increases, the reward will increase. In this section, we mainly evaluate the influence of the number of layers where the number of I-agents, simulation time and physical layout graph for I-agents are fixed. Here, let the number of I-agents be 81; let the simulation time for I-agents be 270; let the lower-layer layout graph be 81 × 81 vertices; let the time step ratio be M = 3; let the region block be a square area with 3 × 3 vertices. However, some parameters will change with the number of layers. Each agent has about 5 neighbours, and each vertex has 5 information levels. The joint action space and joint observation space are about 5 81 . Other parameters are the same with that in case experiment. As for the L layers, let l = L be the highest layer, and l = 1 be the lowest layer. There are 4 scenarios in the experiment.  • Scenario B: The number of layers is 2. The swarm leader controls 27 sub-swarms, while each sub-swarm leader controls 3 I-agents. Because the time step ratio M = 3 and the simulation time for the lowest layer is 270, the simulation time of the highest layer is 90. Each region block is 3 × 3, then the layout graph of the highest layer is 27 × 27 vertices.
• Scenario C: The number of layers is 3. The swarm leader controls 9 sub-swarms, while each sub-swarm leader controls 3 subordinates agents. Because the time step ratio M = 3 and the simulation time for the lowest layer is 270, the simulation time of the highest layer is 30. Each region block contains 3 × 3 vertices, so the layout graph of the highest layer is 9 × 9 vertices.
• Scenario D: The number of layers is 4. The swarm leader controls 3 sub-swarms, while each sub-swarm leader controls 3 subordinate agents. Because the time step ratio M = 3 and the simulation time for the lowest layer is 270, the simulation time of the highest layer is 10. Each region block contains 3 × 3 vertices, so the layout graph of the highest layer is 3 × 3 vertices. , we know that the reward of USPA is at least 34.37% better than that of RA. In addition, as the number of layers increases in USPA, the reward will decrease. In deed, the decision-making process of the swarm leader has hysteresis. Based on Eq 4, the real time of one upper-layer time step t h is equal to the real time of M Á t l lower-layer time steps. In Scenario D, one time step of the 4-th layer corresponds to 27 time steps of the 1-th layer. That means, the environment will change 27 times when the swarm leader makes a decision. Thus, as the number of layers increase, the hysteresis becomes greater and the reward decreases. Patrolling with the UAV swarm Table 4 shows the average decision time for the swarm leader and the mean square error of 20 rounds. It is obvious that the time of RA is less than USPA. Meanwhile, as the number of layers increases, the time will decrease. Because the number of sub-swarm leaders directly subordinate to the swarm leader decreases. Therefore, the cost of reducing decision time is to reduce the reward.

Evaluation of horizon.
In this section, we mainly evaluate the influence of horizon. In order to compare with POMCP, we decrease the number of I-agents and the size of layout graph. We take 4 I-agents into consideration. 4 I-agents are divided into 2 sub-swarms, and each sub-swarm contains 2 I-agents. The joint action space and joint observation space is about 5 4 . The lower-layer layout graph contains 9 × 9 vertices, while the upper-layer layout graph contains 3 × 3 vertices. The region block contains 3 × 3 vertices. Additionally, there are 2 types of horizons in USPA, i.e. upper-layer horizon D h for the swarm leader, and lower-layer horizon D l for the sub-swarm leader. Here, we set D h = D l . Moreover, there is no horizon for RA. Other parameters are the same with that in case experiment. There are 4 scenarios in the experiment.
• Scenario A: The horizons is 1.
• Scenario B: The horizons is 2.
• Scenario C: The horizons is 3.
• Scenario D: The horizons is 4. The y axis represents the total information value gathered by all I-agents. From the figure, the reward of POMCP and USPA is much better than the reward of RA. Meanwhile, the ratio of the reward of POMCP to the reward of USPA changes dynamically. Specifically, the ratios are 1.06, 1.03, 1.09 and 1.15, corresponding to Scenario A, Scenario B, Scenario C, and Scenario D separately. In fact, the larger the horizon, the longer the agent can predict. However, the sequential allocation method is used in USPA. The first assigned agents will gather more information, and the latter assigned agents will avoid previous paths. Therefore as for USPA, as the horizon increases, the reward will increase at the beginning. Nevertheless, when the horizon exceeds a certain threshold, the reward will decrease. Table 5 shows the average decision time for the swarm leader and the mean square error of 20 rounds. The unit of time is seconds. As for POMCP and USPA, as the horizon increases, the decision time will increase. Obviously, the time of POMCP is much larger than the time of USPA, while the time of USPA is much larger than the time of RA.

Experiment summary.
In this section, we conduct the experiments from three aspects: the number of sub-swarms, the number of layers, and the horizon. In addition, we compare USPA with POMCP and RA. These experiment results show that USPA meets our expectation that the I-UAVs can gather large enough information and it takes a very short computing time for decision makers. Moreover, our algorithm has some practical meanings. Firstly, it is obvious that the more sub-swarms the more reward. Thus, when conditions permit, UAVs should be placed as much as possible. Secondly, under the conditions of the same I-UAVs, target area and time, the number of layers of the UAV swarm should not be too large. The cost of reducing decision time is to reduce the reward. It means that the flat command and control structure is a better option when time is enough. Thirdly, when using sequential allocation method, the horizon for the decision-maker should not be too long. It is better to find the most suitable value by weighing the reward and decision time.

Conclusion and future work
In this paper, we develop a patrolling task planning algorithm for the UAV swarm with double-layer centralized control structure under the uncertain and dynamic environment. Unlike previous work, we take the complex relationship into consideration. Based on the model of double-layer environment, we give models of three types of UAVs. Given it, the UAV swarm patrolling problem is modeled as POMDP. In order to reduce the state space, the compact information belief vector is proposed according to the independent evolution property of each vertex. After that, information heuristic function is put forward to increase reward based on the property of multi-state Markov chains. Although the swarm leader could get the information from the sub-swarm leader, it is critical to build the compact information belief and information heuristic function, which increases the autonomous decision ability of the swarm leader and reduces the interaction frequency. And on this basis, we construct single agent patrolling algorithm, team of agents patrolling algorithm and UAV swarm patrolling algorithm. Our algorithm has the scalability and guarantees performance. It reduces the computation complexity for the swarm leader as the number of layers increases. Finally, we conduct simulation experiments to evaluate the performance of our algorithm.
There are several contributions in this paper. Generally, our algorithm can be applied in a wide range of domains which exhibit the general properties of sub-modularity, temporality, locality and multi-layer. The integration of computable structure and myopic algorithm can be applied into more scenarios of the UAV swarm. Specifically, this algorithm improves the patrolling efficiency, at the same time guarantees performance. In addition, our algorithm has scalability, which is easy to extend to more control layers. Moreover, our algorithm can alleviate the computing pressure of the centralized control node, allocating the computing work to other sub-decision nodes. Therefore, our algorithm provides a kind of effective ways to solve the patrolling problem of large-scale multi-UAV system. However, there are some conflicts between the number of layers and the number of sub-nodes subordinate to a decision node. In one hand, the computing capability of a decision node is finite, which cannot control infinite UAVs. In the other hand, as the number of layers increases, the performance of our algorithm will decrease exponentially. So there are some trade-offs between the number of layers and number of UAVs.
The main challenge in extending our work is to take the swarm intelligence into consideration. In this paper, we have considered the double-layer centralized control structure. However, it is just one of control structures. In fact, the UAV swarm is different from general multi-agent systems, having swarm intelligence and swarm behavior. In fact, the complexity of the UAV swarm derives from the combination of the bottom-up autonomy and the top-down command. The swarm intelligence reduces the burden of UAV operator and improves the search efficiency. The main challenge in extending our work is the need for radically different techniques. Intuitively, the swarm intelligence is reflected in adaptivity. The UAV swarm can autonomously adjust to adapt different environments and missions. Different control structure adapt to different environments and missions. Thus, an feasible way is to construct mixture control structure. Specifically, the centralized decision problem can be modeled as POMDP. As for the decentralized decision problem, we can model them as the Dec-POMDP and distributed constraint optimization problem (DCOP).