A novel and effective method for solving the router nodes placement in wireless mesh networks using reinforcement learning

Router nodes placement (RNP) is an important issue in the design and implementation of wireless mesh networks (WMN). This is known as an P-hard problem, which cannot be solved using conventional algorithms. Consequently, approximate optimization strategies are commonly used to solve this problem. With heavy node density and wide-area WMNs, solving the RNP problem using approximation algorithms often faces many difficulties, therefore, a more effective solution is necessary. This motivated us to conduct this work. We propose a new method for solving the RNP problem using reinforcement learning (RL). The RNP problem is modeled as an RL model with environment, agent, action, and reward are equivalent to the network system, routers, coordinate adjustment, and connectivity of the RNP problem, respectively. To the best of our knowledge, this is the first study that applies RL to solve the RNP problem. The experimental results showed that the proposed method increased the network connectivity by up to 22.73% compared to the most recent methods.


Introduction
Wireless communication is growing and being widely applied in many fields.In the local area network of agencies, businesses, schools, and so on, wireless mesh networks (WMN) [1,2] are the best choice today because of their significant advantages compared to wireless networks using traditional access points.The most notable benefit of the WMN is that it reduces congestion owing to its ability to balance the loads.In addition, the installation of a WMN is very convenient because there is no need to construct wired connections from the gateway to all routers.Fig 1 illustrates an example of a WMN consisting of six mesh routers (represented by r 1 to r 6 ) and eleven mesh clients (represented by c 1 to c 11 ).In addition, at least one the router of the Internet service provider serves as a gateway for clients to access the Internet.If two mesh routers are within range of each other, a wireless link is established between them.A mesh topology consists of of all the mesh routers and wireless links.For a WMN to deliver Internet services, several mesh routers must be connected to the gateway router via wireless or cable links.As shown in Fig 1, the mesh routers r 1 and r 2 are connected to the gateway router (GPON or FTTh router) via wireless links.Mesh clients are terminal devices that are users of network services.When a mesh client enters the network region, it can be covered by one or more mesh routers; the mesh client connects to the nearest mesh router to access network services.
With the rapid development of wireless and mobile communication technologies, network services are becoming more diverse and rich, especially those on fifth-generation (5G) and sixth-generation (6G) wireless network platforms.To effectively provide these services, WMNs must be designed and installed in the most efficient manner possible, allowing network resources to be fully utilized.This is the motivation for researchers to focus on WMN.Some of the most prevalent subjects that have been implemented include network topology control [3][4][5][6][7], router node placement (RNP) [8][9][10][11][12][13][14][15][16][17][18][19][20][21][22][23][24], optimum routing protocols [25][26][27][28][29], and access point allocation [30][31][32][33], with the RNP challenge being the most fascinating.Because the RNP problem is known to be NP-hard, it cannot be solved using conventional algorithms.Recently, approximate optimization methods have become useful for solving this problem [8][9][10][11][12].The authors of [8] have used the coyote optimization algorithm (COA) to solve the RNP problem.Their proposed method optimizes both network connectivity and user coverage, which are two critical performance criteria.Using MATLAB simulations, the authors demonstrated that the COA algorithm outperformed other well-known optimization algorithms.In [10], the authors suggested an optimal method called the Chemical Reaction Optimization (CRO) algorithm to solve this problem.The CRO algorithm was inspired by how molecules interact to achieve a low, stable energy state in chemical reactions.In terms of client coverage and network connection, the simulation findings reveal that their suggested approach outperforms the Genetic approach (GA) and Simulated Annealing (SA).Another study employed a genetic algorithm and simulated annealing to discover a low-cost WMN configuration while satisfying restrictions and identifying the number of gateways needed [34].Experiments showed that the evolutionary algorithm and simulated annealing were successful in lowering WMN network expenses while maintaining QoS.The new models significantly outperformed the conventional solutions.QoS was also considered in the RNP problem in [23].The authors described a unique particle swarm optimization method for improving network connectivity and client coverage.The QoS restrictions for this study are the delay, relay load, and Internet gateway capacity.In [35], the authors suggested an improved version of the Moth Flame Optimization (MFO) algorithm, namely, Enhanced Chaotic Le ´vy Opposition-based MFO (ECLO-MFO), for solving the RNP problem.To improve the optimization performance of MFO, the proposed method integrates three strategies: the chaotic map concept, Le ´vy flying strategy, and Opposition-Based Learning (OBL) technique.The simulation results showed that the proposed algorithm was more efficient than the method of applying popular optimization algorithms.
Based on the results of published works, we find that the method of using approximate optimal algorithms provide good solutions.However, because randomness is used in several steps of the algorithm, the results often differ for different executions.For accurate results, each script must be executed multiple times, and then the average of all executions is obtained.For example, the authors of [8,11] executed each simulation scenario 50 times.Furthermore, with heavy node density and wide-area WMNs, solving the RNP problem with approximation algorithms often presents many difficulties, necessitating a more effective solution.In this paper, we propose a new and effective algorithm to solve this problem.The main contributions of this study are summarized as follows: (i) We proposed a novel and effective method for solving the RNP problem using RL.The RNP problem is modeled as an RL model, with the environment, agent, action, and reward representing the network system, routers, coordinate adjustment, and connectivity respectively, of the RNP problem.To the best of our knowledge, this is the first study to apply reinforcement learning to the RNP problem.
(ii) We compared and evaluated the performance of the RNP problem solving method using the heuristic algorithms and the RL method.
The remainder of this paper is organized as follows.The next section describes the formulation of the RNP problem in the WMN.The following sections present our proposed solution and experimental results.Finally, concluding remarks and promising future studies are presented in the last section.

RNP problem
In this section, we formulate the RNP problem in a WMN.First, graph theory was used to describe the WMN.We then define some metrics to use for the objective function of the RNP problem, similar to [11].Finally, the RNP problem was formulated as a nonlinear programming problem.For convenience, we define the mathematical symbols shown in Table 1.

Mathematical model of a WMN using graph theory
Consider a WMN comprising m mesh routers, n mesh clients, and k gateway routers.Mathematically, this WMN can be represented as an undirected graph, denoted by G = (V, E), where V and E are the vertex and edge sets, respectively.V is equivalent to the set of all nodes in the WMN and is determined by where R, C and W are the sets of mesh routers, mesh clients, and gateway routers, respectively.E is equivalent to the set of all wireless links in the WMN and consists of three types: links between mesh routers, links between mesh client and mesh router, and links between gateway and mesh router.

RNP problem formulation
In this section, we formulate the RNP problem using some concepts and metrics from [11], including the connected router, connected client, connected router ratio, and connected client ratio.
Connected router.The mesh router r i is a connected router if and only if at least one path exists between it and the gateway router.If we return to the WMN example in Fig 1, we can see that mesh routers r 1 , r 2 , r 3 , r 4 and r 6 are the connected routers but r 5 is not because no path exists from this mesh router to the gateway router.
Connected router ratio (CRR).The CRR is defined as the percentage of connected routers in relation to the total number of routers in a WMN, calculated by [11] where m is the number of routers in a WMN and α(r i ) is a function that indicates whether  router r i is a connected router or no, defined by Connected client.Mesh client c i is a connected client if and only if it is covered by at least one connected router.Let β(c i ) be a function that indicates whether client c i is a connected client, returning 1 if yes and 0 otherwise.Then, β(c i ) is calculated as where d r is the coverage radius of the routers, d(c i , r j ) is the distance between client c i and router r j , given by dðc i ; r j Þ ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where ðx c i ; y c i Þ and ðx r j ; y r j Þ are the coordinates of the client c i and router r j , respectively.Con-

Connected client ratio (CCR).
The CCR is defined as the percentage of the connected clients in relation to the total number of clients in a WMN, calculated by [11] where n is the number of mesh clients and β(c i ) is determined according to (3).
Formulate the RNP into a nonlinear programming problem.The RNP problem in the WMN is stated as follows: Consider a case where it is necessary to design and install a WMN with the following assumptions: • The network system is located in an area of W×H meters.
• The number of clients is n, and they are located at a given set of coordinates • The number of gateway routers is k, they are located at a given set of coordinates P w ¼ fðx w i ; y w i Þji ¼ 1::kg, and the coverage radius of each gateway router is d w .
• The number of mesh routers was m, and the coverage radius of each mesh router was d r .
Find the set of coordinates P r ¼ fðx r i ; y r i Þji ¼ 1::mg to place m routers such that CRR and CCR are at their maximum.Thus, the NRP problem can be described as the following nonlinear programming problem: subject to the following constraints: where W and H are the width and height of the network, respectively.By solving the nonlinear programming problem with objective functions (6) and the constraints ( 7) and ( 8), we find the coordinate set P r ¼ fðx r i ; y r i Þji ¼ 1::mg to place m mesh routers in the network area W×H.This nonlinear programming problem can be solved in various ways.Recently, the method of applying approximate optimization algorithms to solve this problem has become popular [8][9][10][11][12].In this work, we propose a new and effective method to solve this problem using reinforcement learning.The following sections describe this new method in detail.

Fundamentals of RL
A type of machine learning is called RL, in which the system learns from its past actions to choose wiser ones in the future.Fig 2 depicts the fundamental principles of RL, in which an agent operates as a learner, interacts with the environment to gain a reward and changes the state of the environment.At time t, the agent interacts with the environment through a t action.
The environment changes from s t state to s t+1 state as a result of this activity, and the agent is rewarded with an r t .Based on the rewards acquired in the prior learning, the agent selects the action that provides the best reward in the following learnings.The total reward for taking the a t action in s t state is Q(s t , a t ), which is typically determined by the Q-learning algorithm as follows [36]: where α and γ 2 [0, 1] are the learning rate and the discount factors, respectively.RL has been successfully applied to control protocols in wireless networks, typically routing in WMN [25,27,29], topology control in wireless sensor networks [37], improving the performance of energy-harvesting wireless body area networks [38,39].In this paper, we apply RL to solve the RNP in WMN.Details of this new proposal are presented in the following sections.

Solving the RNP in WMN using RL
The RL has recently been successfully employed to solve technical challenges in wireless communication such as routing [27,36], topology management [37], and resource allocation.In this study, we use RL to solve the RNP problem.To the best of our knowledge, this is the first study to use RL to address the RNP problem.To do this, the RNP problem must be modeled as a reinforcement learning model with five characteristic factors: agent, environment, state, action, and reward.
Agent.An agent is a mesh router that regularly adjusts its coordinates to obtain an optimal topology.
Environment.In a RL model, the environment is everything that exists around the agent, and it is where the agent acts and interacts.The environment for the RNP problem using RL is the network system, which includes a set of mesh routers, clients, gateway routers, and network area.
State.Each state is determined by a triple {P c , P r , P w }, where P c , P r and P w are the sets of coordinates for the mesh clients, mesh router, and gateway routers, respectively.The sets are listed in Table 1.
Action.Action is the way in which the agent interacts with the environment to change its state.For the RNP problem using reinforcement learning, the agents are the mesh routers.Each action was defined by a mesh router that adjusted its coordinates.The set of actions at a specific state s t for each mesh router r i is defined as A t = {mn1s, me1s, ms1s, mw1s, mn2s, me2s, ms2s, mw2s}, where the actions are described in Table 2, step is a given distance.
Reward.The agent receives a reward for each action that interacting with the environment.The agent chooses the next action based on the reward value of past actions, with the goal of eventually achieving the best reward.For the RNP problem using reinforcement learning, we used the objective function defined in (6) as the reward for the learning process.This objective function consists of two metrics: CRR and CCR.To maximize both these metrics, we define the reward function as follows: where RW(r i , s t , a t ) is the reward obtained when the mesh router r i performs the action a t 2 A t at state s t , CRR t and CCR t are the connected router ratio and the connected client ratio at state s t , calculated according to (1) and ( 5), respectively.λ is a coefficient in the range [0, 1], that is used to control the optimal degree of the metrics.In this study, the Q-learning algorithm is used to update the total reward each time a mesh router performs an action.Let Q(r i , s t , a t ) be the total reward received after the mesh router r i performs the action a t 2 A t at state s t , then Q where α and γ 2 [0, 1] are the learning rate and the discount factors, respectively.RL algorithm for solving RNP problem.Algorithm 1 is the pseudo code of the RL algorithm for solving the RNP problem in the WMN.First, m routers are placed at random coordinates in a network area of W × H [m] (step 1).For each learning time, the mesh router r i was randomly selected from set R to perform an action in set A t .The policy for selecting an action a k in set A t is ε -greedy as in [37].For this policy, the mesh router r i chooses action a t at state s t with a high probability of 1 − ε if the Q(r i , s t , a t ) the value is maximum.The remaining actions in set A t are chosen with an equally low probability ε (step 7), where ε is set to 0.1, as in [37].Let π(r i , s t , a t,k ) be the probability that the mesh router r i chooses action a t,k at state s t .A ccording to ε -greedy policy, this probability is given by [37] where |A t | denotes the size of the set A t , that is, the number of actions that the mesh router r i can select.Algorithm 1 The pseudo-code of the reinforcement learning algorithm for solving RNP problem Input: • Network area (W × H); • The set of mesh clients (C = {c i |i = 1..n}), and the set of its coordinates (P c ¼ fðx c i ; y c i Þji ¼ 1::ng); • The set of gateway routers (W = {w i |i = 1..k}), and the set of its coordinates (P w ¼ fðx w i ; y w i Þji ¼ 1::kg); • The set of mesh routers (R = {r i |i = 1..m}), and the coverage radius of each mesh router (d r ); Output: The set of the best coordinates of m mesh routers: P r ¼ fðx r i ; y r i Þji ¼ 1::mg Method: 1: Place m mesh routers at the coordinates ðx r i ; y r i Þ; i ¼ 1::m, where x r i and y r i are random values in the area W × H; 2: while (learn � numLearn) do 3: Randomly choose mesh router r i 2 R; 4: for (each action a j 2 A) do 5: Update Q(r i , s t , a t,j ) using ( 11 Analyze the computational complexity.The computational complexity of Algorithm 1 depends mainly on the iteration in Step (2), the number of possible actions in Step (4), and the algorithm for updating the Q value in Step (5).Q(r i , s t , a t,j ) is updated using Eq. ( 11), where the greatest complexity is the calculation of RW(r i , s t , a t ) according to (10).RW(r i , s t , a t ) contains two metrics, CRR and CCR, which are defined by ( 1) and ( 5), respectively.To determine CRR, we employed a breadth-first search algorithm on a network of m vertices, which is the number of mesh routers.Therefore, the computational complexity was O(m 2 ).The CCR is calculated using two nested loops of sizes m and n, where n is the number of mesh clients.Therefore, the complexity was O(m × n).Because n is always greater than m in a WMN, the computational complexity of RW(r i , s t , a t ) is O(m × n).Consequently, the computational complexity of Algorithm 1 is O(I × |A| × m × n), where I is the number of iterations and |A| is the number of possible actions.
The computational complexity of Algorithm 1 is greater than that of the algorithms solving the RNP problem using GA [40], PSO [24], and WOA [41], which we compare in the following section.However, because its computing complexity is a polynomial function, it can be implemented in practice.Furthermore, because the algorithms for solving the RNP problem are run offline, the polynomial complexity is acceptable.

Simulation scenarios
The performance of the proposed method was evaluated through a simulation using Python.Our proposed method is compared with the most recent methods that use approximate optimization algorithms to address the RNP problem, including GA [40], PSO [24], WOA [41], and MVO [11].All experiments were run on a 3.6 GHz Core i7 CPU computer.The surveyed network instances (NI) are presented in Table 3. NI-1 and NI-2 were used to investigate the effect of the number of mesh routers on the network performance, with the number of mesh routers ranging from 20 to 45 covering 150 mesh clients (NI-1) and 350 mesh clients (NI-2).NI-3 and NI-4 ware used to study the effect of client density, varying from 100 to 400.In NI-5 and N-6, the effect of the coverage radius of each mesh router was thoroughly examined.The final two NIs were used to investigate the influence of the network area.The parameters of the simulation scenarios and algorithms are presented in Table 4, where th parameters of the GA, PSO, WOA, and MVO are set as in [11].

Simulation results
Topology evaluation.First, we evaluate the topology obtained when solving the RNP problem using the GA, PSO, WOA, MVO, and our proposed method, which employs ] and a coverage radius of 200 [m] for each mesh router.We can observe that the method using RL provides the most optimal topology compared with the methods using approximate optimization algorithms, GA, WOA, PSO, and MVO.Specifically, for the method using reinforcement learning, there are 334 mesh clients covered by at least one mesh router, corresponding to a rate of 95.43%.These values were 292 (83.43%), 309 (88.29%), 313 (89.43%), and 313 (88.86%) for the WOA, GA, PSO, and MVO algorithms, respectively.In addition, the topology of the reinforcement learning method has a wider coverage area than the other methods, which can increase the percentage of clients covered in the case of denser clients.Impact of mesh router density.In this section, the impact of the mesh router density on network performance is investigated using various simulation scenarios.We use the most important metric often used to evaluate the performance of RNP problem solving methods, that is, network connectivity (NC).In our context, the NC is calculated as where α(r i ) and β(c j ) are determined according to (2) and (3), respectively, m and n represent the number of mesh routers and mesh clients, respectively.The results obtained in Fig 4 clearly show the difference in network connectivity between the proposed method and the method using approximate optimization algorithms.These findings were obtained using NI-1, in which the number of mesh routers varieed from 20 to 45, covering 150 mesh clients in an area of 2000 × 2000 [m 2 ] and a coverage radius of 200 [m] for each mesh router.We can observe that the NC increases proportionally with the number of mesh routers for all methods.This is evident because as the number of mesh routers increases, the coverage area expands, increasing the probability of mesh clients being covered.Comparing the methods of solving RNP problems, the method using RL (legend namely RL-based RNP) gives the highest NC.For example, considering the case of 35 mesh routers, The NC values of the methods using the WOA, PSO, GA, MVO, and RL are 85.64, 87.42, 90.67, 93.42, and 95.68%, respectively.Thus, compared with the method using algorithms WOA, PSO, GA, and MVO, the proposed method improved NC by 10.03, 8.25, 5.01%, and 2.25%, respectively.This is a significant result in improving WMN performance.The results obtained were quite similar for the implementation on NI-2, as shown in Fig 5.The assumptions of this simulation scenario are the same as those in NI-1, except that the number of mesh clients increases to 350.We can see that the proposed method is highly effective in terms of NC.We can observe that the proposed solution provides high efficiency in terms of NC for most values of the number of mesh routers.The NC of the method using RL increases by an average of 4 to 20% compared with the cases where approximate optimization algorithms are used.As is the case with 35 mesh routers, the NC of the RL is 98.71%.These values of the WOA, PSO, GA, and MVO algorithms were 81.66%, 86.89%, 88.59%, and 94.44% respectively.Thus, the method using RL improved the NC from 4.26% to 17.04%.
Based on the findings in Figs 4 and 5, we can conclude that changing the number of mesh routers affects on network performance in terms of NC.The larger the number of routers, the higher the NC for all investigated RNP problem solving methods.In particular, the method based on RL is the most efficient.
Impact of mesh client density.In this section, we investigate the effect of client density on network performance.In a WMN, the denser the clients, the greater is the number of connection requests to the routers.As a result, network performance was affected.This is more , where the number of mesh routers is 30, covering 150 to 300 mesh clients.We can easily observe that the method using RL always yields the highest NC regardless of whether the client density is sparse or dense.The NC value of this method from 90.43% to 95.79%.Meanwhile, the NC values for the cases of algorithm WOA, PSO, GA, and MVO are fom 74.59% to 84.08%, from 77.00% to 85.63%, from 82.32% to 91.46%, and from 88.27% to 90.83%, respectively.When 45 mesh routers were used (NI-4), the NC value increased for all methods.This is clearly shown in Fig 7, where we represent NC versus the number of mesh clients.Comparing the methods, we find that the method using RL outperforms the method using approximate optimal algorithms in terms of NC.  because this uses more mesh routers.As in the previous scenarios, the method using RL always yields the highest NC.
Impact of network area.In the last section, we investigate the effect of network area on the efficiency of RNP problem solving methods.].The NC value decreased according to the network area for all the algorithms.This is because, for a given number of mesh routers, the larger the network area, the lower the percentage of area covered, leading to a decrease in the NC value.However, the NC value of the method using RL is always the largest.
Based on the above findings, we can conclude that the proposed method, which uses reinforcement learning to solve the RNP problem, is more efficient than a method that uses approximate optimal algorithms.This is a crucial result in the design and implementation of a WMN, which helps find an optimal network topology to exploit network resources more efficiently.

Conclusion
The placement of router nodes in wireless mesh networks is a significant problem that has recently attracted the interest of several research groups.This problem is recognized as NPhard, and cannot be resolved using conventional algorithms.In this study, we proposed a new and effective method for solving this problem using RL.The process of finding the optimal coordinates for placing mesh routers is modeled as an RL with the main components being environment, agent, action, and reward, which are equivalent to the network system, routers, coordinate adjustment, and network connectivity of the RNP problem, respectively.Simulation results show that our proposed method outperforms the most recent methods in terms of coverage and network connectivity.
In future work, we will continue to develop this method by considering additional constraints on the quality of transmission and load balancing to improve network performance.In addition, the deep reinforcement learning method can also be applied to static and dynamic RNP problems to further improve the performance of the WMN.

Fig 1 .
Fig 1.An example of a wireless mesh network.https://doi.org/10.1371/journal.pone.0301073.g001 the example of WMN inFig 1,  we can easily observe that the set of connected clients are listed as {c 1 , c 3 , c 5 , c 6 , c 7 , c 8 , c 10 , c 11 }.Client c 9 is not a connected client because it is not covered by any mesh router.For clients c 2 and c 4 , although they are covered by router r 5 , they are not connected clients, because r 5 is not the connected router.

Table 1 . The notations used in this paper. Notation Description
g ¼ fðx g i ; y g i Þji ¼ 1::kg Set of the coordinates of gateway routers