Self-Healing Networks: Redundancy and Structure

We introduce the concept of self-healing in the field of complex networks modelling; in particular, self-healing capabilities are implemented through distributed communication protocols that exploit redundant links to recover the connectivity of the system. We then analyze the effect of the level of redundancy on the resilience to multiple failures; in particular, we measure the fraction of nodes still served for increasing levels of network damages. Finally, we study the effects of redundancy under different connectivity patterns—from planar grids, to small-world, up to scale-free networks—on healing performances. Small-world topologies show that introducing some long-range connections in planar grids greatly enhances the resilience to multiple failures with performances comparable to the case of the most resilient (and least realistic) scale-free structures. Obvious applications of self-healing are in the important field of infrastructural networks like gas, power, water, oil distribution systems.


Introduction
In the field of complex networks [1,2], most studies have been focused on how to improve the robustness (i.e. the capability of surviving intentional and/or random failures) of existing networks [3]. Much less has been done regarding the resilience (i.e. the capability of recovering failures). In fact, implementing smart (as well as economic) strategies aimed at maintaining high level of performances is a crucial issue yet to be solved and represents one of the most pressing and interesting scientific challenge. A most important field of application for the results of such investigations are infrastructural networks. Infrastructural networks are the backbone of our society that critically depends on the continuity of functioning of systems like power, gas or water distribution.
As a standard, infrastructural networks have been designed to be resilient at least to the loss of a single component; on the other hand, their constantly growing size of has increased the possibility of multiple failures which often have not been considered in their original design. In general, implementing the possibility of recovering from any sequence of k failures requires an exponentially growing effort in means and investments; it is therefore viable to consider implementing systems that are able to recover from k failures on average: in this paper we will follow such a statistical approach.
In the field of communication [4][5][6] and wireless networks [7][8][9][10] self-healing algorithms have recently been the subject of massive investigation. In general, such strategies aimed at maintaining network connectivity assume the possibility of creating anew communications channels among the nodes of the networks, often with no constraints on the number of new connections available [11]. This is no the case in infrastructural networks, where the possibility to create new links among nodes is normally not available (at least in the short run), since links are physical (fixed in advance) and creating new ones requires both time and investments.
In general, self-healing in infrastructural networks should be though as a constrained mechanism in which only a limited amount of resources is available. An example of such an approach can be found in material science where new polymeric compounds are capable of self healing due to the presence of small amounts of healing agents that gets released and activated upon cracking [12,13]. An alternative strategy to ensure the continuity of a system is to ensure redundancy in the interconnectivity of its components; for example, when a hole is punched in a leaf, the remaining vessels are capable to sustain the extra flow necessary to keep the tissues alive [14].
Infrastructural networks are very well engineered systems characterised by fluxes of commodities (from electric power to drinking water). In this paper we consider a simplified description of such systems in terms of complex networks with a simple dynamical process describing the flow of a commodity from one or more sources (production) to several sinks (consumption). We the introduce a novel healing strategy based on the activation of fixed redundant resources (backup links) via a generic routing algorithm and study the resilience of the networks to multiple failures. The presence of such backup links is customary in technological networks; hence, our self-healing procedure is within the reach of current technology. As an example, urban low-voltage distribution power grids have an almost planar topology and are essentially radial (tree-like) networks with few inactive backup-links that can be activated (often manually) to restore power in case of failures.

Model
In our scenario the system is assumed to describe a network that distributes some utility following flow conservation analogous to Kirchoff's current law; examples of systems following such constrains are not only power grids, but also the flows of fluids in distribution networks like water, gas, oil (at least at stationarity). As a further simplification, we will consider a single node to be the source of the commodity distributed on the network and we will not consider any constrain on the amount of flow that can be transported by any link; hence, connectivity among a sink and a source in enough to have the sink served. In our scenario, all the nodes (except the source) are considered to be potential sinks. Hence, to serve as many nodes as possible, connectivity must be maximized.
A further assumption is that at each instant of time, the topology of the network distributing the commodity is a tree (the active tree); in fact, such a structure meets the infrastructures' managers needs -i.e., to measure (for billing purposes) in an easy and precise way how much of a given quantity is served to any single node of the network. In networks -like drinking water -where such assumption is not strictly true, very few loops (i.e. low meshedness) are present [15].
To model active trees, we start from an underlying network topology and build up a spanning tree. The set of all possible redundant links is exactly the set of links in the network not belonging to the spanning tree. In order to allow for recovery, we also consider the presence of dormant backup links -i.e., a set of links that can be switched on -as in the case urban of low-voltage distribution power grids.
While commodities can be transported only via active links, in order to implement our self-healing strategy we assume that nodes are able to communicate by means of a suitable distributed interaction protocol only with the set of neighbouring nodes, i.e. the ones connected either via active or via dormant links. According to our procedure, when either a node or a link failure occurs, all the nodes that cannot be served -i.e. there is no path to the source -disconnect from the active tree. Afterwards, unserved nodes try to reconnect to the active tree by waking up (activating) through the protocol some of their dormant backup links. Such a process reconstructs a new active-tree that can restore totally or partially the connectivity, i.e. heals the system. A more formal description of the self-healing procedure and of the simulation protocols are provided in the Methods section.
A natural metric to quantify the success of such a procedure is the fraction of served nodes (FoS). In order to identify the system's properties that are able to maximize the FoS we study the effects of varying the fraction of backup links (redundancy) according to different underlying connectivity patterns with respect to multiple random failures.
In order to stress the peculiarities of different network structures, we generate class of graphs with different connectivity patterns (see Methods). We start our investigation by focusing on the underlying topology which often resembles the actual situation of infrastructural networks -i.e. nodes disposed over a planar square grid (SQ). Then, we stress the role of the underlying networks' connectivity patterns by using the scale-free (SF ) topology generated according to Barabasi-Albert [16] and the small-world (SW ) topology generated according to Watts and Strogatz [17]. All the initial network structures are generated by using the IGRAPH library [18].
To generate the random spanning trees associated to each kind of network structures, we use the flat sampling algorithm of Wilson [19]. We take such spanning trees as the initial configuration of our model distribution networks. The links not belonging to the spanning trees form the set of the possible backup links of our system; among such links, we choose a random fraction r of dormant links that can be used to heal the system. We then simulate the occurrence of uncorrelated multiple failures by deleting at random k links of the initial active tree. Notice that link failures are the most general ones, as a node failure is equivalent to the simultaneous failure of all its links.
The source node -i.e, the root of the oriented active tree -is chosen at random within all the nodes of the underlying network. The only exception is the case of the SF networks where we use, according to the preferential attachment principle, the natural choice of having the node with the highest number of neighbours (the central hub) as the source.
Our self-healing algorithm is a routing protocol (see Methods) whose goal is to reconstruct the maximum spanning tree connected to the source after that a failure has occurred; in doing so, we use both the survived links of T and the dormant links D; fig. 1 illustrates such procedure. After the recovery, we calculate FoS the fraction of nodes connected to the source after the recovery.

Effects of networks' topology
In order to test the performances of our healing algorithm to failures in terms of the service provided after the active tree restoration, we simulate the model for increasing number of failures. Recalling that each failure causes a cascade -i.e, each node of sub-tree served by the broken link is unserved -we investigate the role of redundancy r on different topologies.
We start our study by addressing planar square grid (SQ) networks since they are the most similar to the real physical networked infrastructures. In the first scenario, we generate spanning trees on a square grids; fig. 2(a) shows the variation of the restored FoS respect to the number of failures k for different redundancies rs. For square grids, we do not observe any relevance of the redundancy on the FoS; this means that a very small fraction backup links (r~0:1, i.e. 10%) already suffice to attain the maximum resilience.
The situation is completely different when the underlying topology is a scale-free network generated through the Barabasi-Albert model [16]. A widely diffused property of real networks is that the connectivity pattern follows a scale-free power-law distribution [1,20,21]. This feature has been found to be a consequence of the so called preferential attachment -i.e networks expand continuously through the addition of new vertices which attach preferentially to already well connected nodes. Although technological networks do not show power law degree distributions due to economic and spatial constraints [22], we choose to investigate SF networks for their marked robustness upon random failures [23]. For SF networks, it is natural to choose the node with the highest degree (the hub) as the source. The quality of service restored by our self-healing algorithm on SF networks is shown in fig. 2(b). As expected, we find that SF networks can easily recover connectivity to all the nodes even for low redundancies. Such error tolerance comes at a high price of being extremely vulnerable to node targeted attacks: isolating the hub disconnects the whole system. High error tolerance and targeted attack vulnerability are indeed generic properties of SF networks [24].
We then consider the case of small-world (SW ) networks generated according the Watts-Strogatz rewiring procedure [17]. In the case of technological networks, small-world networks are important since they highlight the effects of introducing long-range links in a planar topology. Starting from an initial graph (planar square grids in our case), we rewire with a probability p a link with a randomly selected node; in this way we can interpolate from the case of SQ networks (p~0) to the case of a random graph (p~1). As in the case of simple percolation [25], the rewiring procedure introduces some long range links -i.e., between distant nodes on the square grid) that improve the robustness to random failures.
In order to understand the role of the connectivity pattern we study our model on different SW networks with different rewiring probabilities. In fig. 2(c) we show the performances of our selfhealing strategy with respect to an increasing number of failures. We see that a higher rewiring probability increases the number of served nodes after the restoration through the backup network; such a peculiarity shows up even if the clustering within neighbouring nodes (normally associated to a local robustness against failures) decreases; therefore long range links increase the possibility of the network staying connected even after multiple failures.
Finally, we compare in fig. 3 the effectiveness of the self-healing protocol across different strategies. Notice that while distribution grids based on the SF topology are the more robust, they should be disregarded when considering the case of technological networks since economic and geometric constraints make SF networks unfeasible on planar topologies.

Discussion
In this paper we have introduced a minimal procedure of selfhealing in networks. Such procedure exploits the presence of redundant edges to recover the connectivity of the system. Our scenario is inspired by real-world distribution networks that areoften for economic reasons -almost tree-like and at the same time are provided with alternative backup links that can be activated in case of malfunctioning. An example of such networks is the case of urban low-voltage distribution networks [26].
Our strategy could be readily and easily implemented with the current technologies. In fact, routing protocols represent a vast available source of distributed algorithms able to maintain the connectivity of a system; hence, our scheme could be implemented by the standard procedure of coupling an ICT network to a preexisting infrastructure. Our strategy is an example in which interdependencies among two networks enhance the resilience instead of introducing catastrophic breakdowns [27].
By studying the performances of our procedure as a function of the redundancy on different underlying network topologies, we have shown that distribution networks akin to real world ones -i.e, based on planar lattices -are the less resilient to random failures. In fact, the most robust networks -as expected -are based on the SF topology; however, such a topology is unrealistic for technological networks. Our results on SW topologies hint that a very effective strategy to strengthen realistic networks is to add long range links. The feasibility of such a strategy would depend on the cost-benefit analysis about the implementation of these physical long-range links. A further direction of study would be to consider the effects of more detailed structural characteristics of the underlying network topologies [28] or even to consider biologically inspired designs, like dynamic networks inspired by the human brain [29]. While our minimal model considers only the connectivity of the system, it can be easily expanded to take account of the magnitude of the flows: in fact, routing algorithms can account for both the capacity of the links and dynamically swap re-routing of flows.
Our model easily allows also for cold starts -i.e., for situations in which the network has shut down due to some major events (like a black-out) [30]. This is an important issue as one of the most time (and money) consuming activity after a major event is the restoring of the functionality of the network.
In this paper, we have considered only the single source case. Next step is to consider a network served by multiple sources. In fact, the possibility of separating the system in trees would solve the who is serving who problem that appears as soon as more competitors share the same physical line in bringing power to their customers [31]. Moreover, the possibility for the system of dynamically separating in time-varying trees would allow for introducing a commodity market based on real-time economic competition among the owners of the sources. This further goal is not yet within the reach of current routing protocols and should be further investigated if we want to have grids that are smart not only for their ability to self-repair but also in optimizing consumptions and prices. Finally, we believe that studying and designing self-healing mechanisms in complex networks is a promising field of investigation where also the dynamics of the systems should be taken into account [32,33].

Materials and Methods
The Self-Healing procedure We consider an abstract model of a physical networked infrastructure described by the quadruple N~V ,v S ,E A ,E D ð Þ . Here V are nodes of the network, v S [V is the source node, E A is the set of active links among the nodes and E D denotes the set of dormant links that can be activated in order to heal system failures by re-connecting nodes. A node is considered to be served if it is connected to a source through a path of active links; all the nodes in V are initially connected to the source via a spanning tree. As the basic metric for any quality of service assessment, we consider the fraction of served nodes FoS counting the number of nodes in the active graph -i.e connected to v S . More formally, in the initial configuration, the graph T~V ,E A ð Þ is an instance of the set R T (G) of all the random spanning tree of the underlying graph G~V ,E ð Þ. Thus, T before the failures has jV j (active) nodes and jE A j~V {1 links among them. The set E D of backup edges is taken form the remaining edges of the underlying graph G, i.e. E A |E D (E and E A \E D~. The fraction r~jE D j= jEj{(jV j{1) ½ measures the redundancy of N.
We then consider the occurrence of multiple link failures. A kfailure is a subset E F 5E A of k links chosen at random. The system right after a failure is described by the forest T fail~V ,E A {E F ð Þand by the set E D of dormant links available for the healing. A healing protocol is any algorithm that, by activating (waking up) a subset E W 5E D of dormant edges, finds a maximal tree T 0 of G 0~V ,E D |(E A {E F ð Þ ) containing the source v S . If T 0 is spanning, then the system has fully recovered.  [16]. The average fraction of nodesSFoST of served nodes is plotted against the number of failures k. Even for a low 10% redundancy (r~0:1), the system can almost totally heal after sustaining k*4|10 2 failures; as a comparison, for the same number of failures square grids loose *90% of the nodes. Panel (c): distribution grids based on small-sworld networks obtained by rewiring a fraction p~0:2 of links according to Watts-Strogatz [17]. The average fraction of nodesSFoST of served nodes is plotted against the number of failures k. At difference with square grids and scale-free networks, the restored fraction of service FoS shows a marked dependency upon the redundancy parameter r. Similar results are obtained for p~0:1 and p~0:3. doi:10.1371/journal.pone.0087986.g002 For the robustness of the algorithm, we will assume that nodes have only a local knowledge of the networks -i.e., only about the state (active, dormant or failed) of their incoming links. To build the maximal connected tree, nodes communicate with their neighbors via a suitable distributed protocol allowing fault nodes to join the active network by activating dormant edges. In other words, nodes are endowed only with the minimal requirements of routing needed to reconstruct a spanning tree [34]. In this paper, we have applied the following simple distributed algorithm to implement self-healing: By definition, the nodes in N 0 are the set of served nodes V 0 . Notice that the state Þ still describes a network infrastructure; therefore, we can in general describe the state of the system at time t by the quadruple Þ and the sequence of time failures between   [17] and scale-free network (SF) generated accordint to Barabasi-Albert [16]. Lower panels: random spanning trees associated with the related underlying topologies in the upper panels. doi:10.1371/journal.pone.0087986.g004 time t and tz1 by E F (t). A general representation of such a process can be given in terms of time varying graph [35].

Simulations
We analyse the response of the system to k-failures. To study the effects of different topologies, we perform simulations on different grids ( fig.4 -upper panel).
Classically, engineered systems (especially in the power industry) are built N{1 robust, i.e. they can survive to the single failure of any of their components. While checking N{1 robustness corresponds to checking all the N possible single failures, checking the N{k robustness requires to consider N!=(N{k)!*N k cases. Therefore, checking the N{k robustness is infeasible even for modest values of k due to the combinatorial explosion of the number of possible cases. Thus, we choose to assess on probabilistic ground whether a system would be able to sustain k failures by a Monte Carlo investigation of the space of possible failures.
Service operators are interested in maintaining their service level agreements (contracts) with their customers; to such an aim, customers must in first pace remain connected to the services. Therefore, we calculate the average fraction of served customers FoS after the occurrence and the healing of k random failures. To do so, we choose at random k different links on the service tree and delete them; after that, we apply the self healing procedure; finally, we calculate the FoS as the fraction of nodes connected to the source. We average such procedure over several network realizations until the relative error of the average FoS is small enough (less than 5%). As an example, for a grid of 10 4 nodes, we must typically average over 100 sets of random failures to attain the desired accuracy. Moreover, to average out the different characteristics of the initial configurations, we repeat the procedure over 100 different independently generated initial configurations.
To generate a random spanning tree T associated to a graph G ( fig.4 -lower panel), we apply the exact algorithm of Wilson [19] that samples uniformly the elements of R T (G). Such spanning trees are taken as the initial configurations for our model distribution networks. The links of the graph G that do not belong to the initial configuration T form the set E B of the possible backup links of our system; of such links, only a subset E D (the dormant links) can be used to heal the system. The fraction r~jE D j=jE B j of such dormant links characterizes the redundancy of the system: for r~0 there are no links in E D and any failure splits the tree, while for r~1 any of the links of G can be used to recompose the system.
Notice that in our case it would be more correct to speak about N{k resilience, since we don't consider whether the system is robust to k failures (i.e. whether it still functioning after k failures), but if it can recover from k failures.