A Modeling Framework for System Restoration from Cascading Failures

System restoration from cascading failures is an integral part of the overall defense against catastrophic breakdown in networked critical infrastructures. From the outbreak of cascading failures to the system complete breakdown, actions can be taken to prevent failure propagation through the entire network. While most analysis efforts have been carried out before or after cascading failures, restoration during cascading failures has been rarely studied. In this paper, we present a modeling framework to investigate the effects of in-process restoration, which depends strongly on the timing and strength of the restoration actions. Furthermore, in the model we also consider additional disturbances to the system due to restoration actions themselves. We demonstrate that the effect of restoration is also influenced by the combination of system loading level and restoration disturbance. Our modeling framework will help to provide insights on practical restoration from cascading failures and guide improvements of reliability and resilience of actual network systems.


Introduction
Cascading failure is a common mechanism of large-scale failures in complex network systems, such as electric power transmission grids, water/gas delivery systems, railways, etc. [1][2][3][4][5][6][7]. For a practical example, we can refer to the largescale blackouts of electric power transmission systems resulting from cascading failures initiated by component overloads [6,8]. Occurrences of cascading failures are found statistically more significant than that expected by theory [6,9]. Given the vital societal importance of these critical infrastructures, there is a strong interest in the studies for the design, implementation and evaluation of effective restoration strategies against cascading failures, which rescue systems from the brink of collapse and avoid the amplification of their consequences [10][11][12][13].
Efforts have been carried out to study how to reduce the frequency, duration, intensity and extent of cascading failures. There are many design measures to avoid cascading failures, such as robust structures [14][15][16][17][18][19], capacity and structural redundancy design [19][20][21][22] and n-1 criterion [23], by which cascading failures can hardly be eliminated [24]. After the failure cascades, black-start [25,26], system reconfiguration [27,28] and corrective restoration [29,30] are used to bring the system back to its normal operation conditions. While complete prevention against cascading failures in design stage proves impossible and post-actions only passively recover systems at a large cost, active in-process restoration can mitigate cascading failure during its evolution, leading the system to a stable state. The primary objective of restoration during the process of cascading failures is to take actions to prevent failures from unfolding to catastrophic failures and eventually to minimize the damage, e.g. minimizing the unserved loads in an electric power transmission grid. For example, references [31,32] propose three different strategies based on line switching to minimize the consequences of cascading failures on the entire system, on predetermined areas of the system or on both within a multi-objective optimization framework. References [33][34][35][36] introduce and analyze some restoration planning and restoration actions. Based on the development of fast recovery technology [37], it is possible to mitigate and rescue the system from the cascading failures through real time restoration of network components. Going back to the example of the electric power transmission grid, restoration against cascading failures may be achieved in practice through real-time controlled islanding [38,39], selective load shedding [38,39], wide area monitoring [40], real-time fault analysis and validating relay operations [41], etc.
In this paper, we present a novel modeling framework for analyzing restoration in network systems subject to cascading failures. The framework is used to study the effects of different restoration strategies in terms of restoration timing and strength: t r , the restoration timing in the process of the cascading failure, and p r , the restoration strength, which is quantified by the probability of repairing a failed component. Repair here means full, immediate recovery which can be realized in practice by utilizing fast recovery technology. We study how different restoration strategies described in terms of the two basic quantities (t r and p r ) influence the overall system reliability.

Description of the Restoration Model
We first consider an unweighted and fully connected network of N identical components [42]. The loading-dependent model proposed in [43] is adopted to describe the dynamics of cascading failures. The model is analytically tractable and captures some essential features of the cascading failure process, which helps to understand the mechanism of failure propagation in the network system. The model describes a network composed of identical components with load distributed uniformly in [L min , L max ], and the average initial component loading L5(L min +L max )/2. An initial disturbance D is added to all components, and may cause some components to exceed their capacity threshold L fail 51, which is assumed identical for all the components. If component j is working and L j +D.L fail , component j fails. Then, each failure of a component leads to an additional load P.0 added to all the other functional components in the network, which may cause further failures in a cascade.
The restoration actions will be considered once the cascading failures process has been triggered. A typical in-process restoration procedure is comprised of three stages [33,34]: firstly, estimating system/component status, locating the critical loads, and developing the strategies for rebuilding the network connections; secondly, identifying the paths of restoration, energizing and interconnecting subsystems; thirdly, restoring most of lost loads. Restoration strategies differ from each other in the above aspects. Here we propose a restoration model considering the timing and strength of restoration, which mainly determine the effects of restoration. In the model, each failed component is repaired with a certain probability p r at a given step t r .0 during cascading failure. The restoration actions recover the links of the component to be repaired, while its links to failed components remain disconnected. We assume that restoration may cause some disturbance to the existing functional components in the network. We model this restoration disturbance by adding a random perturbation D r distributed uniformly in ½D min r , D max r to the load of each functional component. The value of restoration disturbance depends on whether the restoration action is implemented appropriately to the system, which could be positive or negative. This means that the restoration may either reduce or increase the loads of the functional components, depending on whether it is beneficial or harmful.
The following algorithm is used to realize the above procedure. The details of the algorithm are summarized as follows: 1. All N components are initially functional and loaded by quantities L 1 , L 2 , …, L n , which are independent random variables uniformly distributed in [L min , L max ]. Initialize the stage counter t to zero.

Effects of Restoration Strategies
In this section, we study the effects of different restoration strategies on the system robustness against cascading failures and the resulting system reliability. We begin our study by evaluating the restoration effect on the total damage made by cascading failure (the average avalanche size). Figure 1 compares different restoration strategies in terms of restoration timing t r and strength p r by measuring the number of failed components ES. As shown in [44], there is a transition of ES occurring at critical point L c 50.8 without restoration (p r 50).
When the value of L is below this threshold, few failures emerge. On the other hand, for L above the threshold, there is a significant risk of cascading failures that lead to global collapse of the system. And in-process restoration can reduce the final damage significantly if it is implemented properly. As shown in Fig. 1, cascading failure under restoration (p r .0) with negative D r generates much smaller avalanche size ES than the case without restoration (p r 50). Furthermore, for negative D r , early restoration (e.g., t r 51) ends up with more functional components than late restoration (e.g., t r 54). For positive D r , restoration worsens the system in terms of ES.
To investigate the effects of different restoration strategies on improving the reliability against cascading failures, we measure the system load fluctuations (SLF) defined as where N is the total number of the components in the network system. L j is the load of component j, and we set L j (t5i) 50 if component j is failed at the moment i. SL(t50) is the initial system load when the system maintains its normal functional state. SL(t5i) is the total load of system at the moment i in the cascading process, i.e., the sum of the loads of all functional components. The parameter T is the duration of the whole dynamical process of cascading failures. The measure SLF reflects the system instability in the whole process of cascading failures, considering the required balance between the supply and demand. Figure 2 shows SLF under restoration at a given restoration timing t r as a function of the restoration probability p r . From Figs. 2a-2d, we can see that the restoration with negative disturbance can effectively mitigate cascading failures and reduce system instability. Furthermore, the system can be improved by high strength of restoration. Each curve corresponds to the average over twenty thousand realizations of networks with 10 5 components. The example network system has no specific topology, on which the results do not depend. The initial component loading can vary from L min to L max 5L fail 51. Then, L5(L min +1)/2 may be increased by increasing L min . The initial disturbance D 5 4610 26 is assumed to be the same as the load transfer amount P54610 26 . All the investigated network systems without restoration satisfy the cascading condition that the cascade step is no less than 5.  The situation for positive restoration disturbance is more surprising. One may expect that restoration would worsen the cascading failure when D r is positive. The corresponding results with positive D r are complicated: restoration can still improve the system for subcritical loading (Fig. 2e); at critical loading L c , restoration produces quite large SLF and induces extra instability (Fig. 2f); for supercritical loading, restoration has almost no impact on SLF (e.g., L50.9, Fig. 2g).
The results above can be explained as follows. The restoration effect is dominated by two factors, restored components and the consequential restoration disturbance. These two factors are cooperative under negative D r so that failed components are recovered, when the load of functional components is decreased. This cooperative effect under negative D r can be stronger for early restoration. When D r is positive, however, restoration will increase the load of functional components when failed components are restored at the same time. The outcome of restoration then depends on the competition between these two factors.
To further explore the effect of restoration disturbance, in Fig. 3 we analyze the restoration effect as a function of the restoration disturbance. For L50.6 and 0.8, restoration (p r 51) significantly increases SLF as the restoration disturbance increases (Figs. 3a and 3b). For supercritical loading, SLF increases for negative D r and then remains saturated for positive D r (L50.9, Fig. 3c), while early restoration (t r 51) can improve system for both negative and positive D r (L50.95, Fig. 3d). Similar as the results in Fig. 2, restoration under negative D r at an early cascade step is beneficial for all investigated cases. When restoration disturbance D r is positive, restoration improves system only for certain values of system loading.
To observe the dynamical processes of restoration, we track the system evolution under restoration in terms of system fluctuations during cascading failure. The load fluctuation of the system at the moment t is defined as LF(t~i)~jSL(t~i){SL(t~0)j: For convenience, here we assume that LF(t)50 when t.T. Figure 4 demonstrates the system evolution process in terms of load fluctuation LF(t), where total system load fluctuation is the corresponding area under the curve of LF(t). Early restoration (t r 51) under negative D r is shown to reduce the load fluctuation since the restoration moment in the process of cascading failures (  Figs. 4a-4d). However, for positive D r , load fluctuation of restoration at L50.6 is lower than that without restoration (Fig. 4e), while for L50.8 load fluctuation is significantly increased (Fig. 4f). And it is not helpful to restore system late with positive D r for a system high loaded (Figs. 4g and 4h).

Analytical Methods
According to the proposed restoration model, n components are loaded in [L min , L max ]. We set 5(L min +L max )/2 and L max 5L fail 51. Then component j has the load L j M[2L-1, 1] and fails when its load is larger than L fail . An initial disturbance D is added to each component. Each failed component transfers a fixed amount of load P to other functional components.
Based on the literature [44], the distribution of the total number of failed components S without restoration can be given by 1{ P n{1 s~0 P(S~s), r~n, where, [x] is the largest integer not more than x and 0ƒd~D 2{2L ƒ1, p~P 2{2L w0, 1{d p ! vn: When rpzdƒ1, n??, p?0, d?0, h~nd, l~np, the above distribution can be approximated by a branching process with P(S~r)<h(rlzh) r{1 e {rl{h r! : ð5Þ Then we have the approximation [45] based on the property of this branching process and Where s i~m1 z Á Á Á zm i . As our investigated configurations satisfy the cascading condition that the cascade step is no less than 5 (or any arbitrary number), we obtain P(T §5)~P(M 5 =0)~1{P(M 5~0 )~p(h,l). Then the distribution of the total number of failed components S without restoration (p r~0 ) is According the parameters in the text, we set l~np~n P 2{2L c~1 and get the critical loading L c~0 :8, which corresponds to the case in Fig. 1. When the restoration strategy (t r , p r ) (t r w0,p r w0) is taken, we assume that the total number of components failed at restoration timing is S t r~s t r vn. Then the current state of the system is as follows: m failed components, s t r {m restored components loaded in ½2L{1,1, (n{s t r ) functional components loaded in ½2L{1zDzs t r PzD r ,1zM t r PzD r , and the failed rate is M t r PzD r 2{2L{D{s t r{1 P , Em~(1{p r )s t r . Then the system may go on evolving after the restoration. And we can clearly know the average avalanche size ES is strongly dependent on the value and sign symbol of the restoration disturbance D r , the restoration timing t r , the restoration strength p r and the system loading level L. When M t r PzD r ƒ0, restoration ends the cascading failure. Then the distribution of the total number of components failed S with restoration is P(S~s r jM 5 =0)~X When M t r PzD r w0, the state of system at t r can be replaced by m failed components and (n2m) functional components loaded in ½2L{1,1 disturbed by the load D'~( Dzs t r PzM t r Pz2D r )(n{s t r ) 2(n{m) . Considering the cascading condition that the cascade step is no less than 5 (or any arbitrary number) and the restoration timing t r , the distribution of the total number of components failed S with restoration is And the distribution of the total number of components failed S r after restoration is s~0 P 1 (S r~s ), r~n': where n'~n{m, d'~D ' 2{2L , p~P 2{2L .
Next we give the analytical results for the proposed modeling framework of restoration. Firstly, we give the comparison between the simulation and theory in case of p r 50 in Fig. 5. The case corresponds to Eq. (8). As shown in Fig. 5, theoretical calculation coincides well with the numerical simulations. And the distribution behaves as a power-law at the critical loading, at which system has a high probability of large-scale failures.
Then, we give the comparison of restoration between simulation and theory in Fig. 6 for negative D r and Fig. 7 for positive D r in case of p r ?0. These cases correspond to Eq. (10). As shown in Fig. 6 and Fig. 7, theoretical calculation coincides well with the numerical simulations.

Model Variations
We apply our modeling framework of restoration to the western U.S. power transmission grid [46] for the model validation. Here we present the results in Fig. 8 and Fig. 9 on the realistic power system with more practical consideration in the model: Variation 1: initial load distribution. We change the distribution of initial component loading from uniform distribution to Gaussian distribution; Variation 2: impact of each failed component on the functional components. Previously, each failure of a component leads to an additional load P.0 added to all the other functional components in the network regardless of network topology. Now each failed component leads to an additional load Q.0 only added to its functional neighbors, which is dependent on network topology; Variation 3: restoration disturbance D r . We change the distribution of restoration disturbance from uniform distribution to Gaussian distribution.  Figure 8 compares different restoration strategies in terms of restoration timing t r and strength p r by measuring ES. There is a transition of ES occurring around critical point L c 50.9 without restoration (p r 50). As shown in Fig. 8, cascading failure under restoration (p r .0) with negative D r generates smaller ES than the case without restoration (p r 50). For negative D r , early restoration (e.g., t r 51) ends up with more functional components than late restoration (e.g., t r 54). For positive D r , restoration worsens the system in terms of ES. The results are similar to Fig. 1. Figure 9 further explores the effect of restoration disturbance in terms of the system load fluctuations (SLF). We can see the effects of restoration are heavily influenced by the restoration strategies. For subcritical loading (L50.8), SLF increases for negative D r and almost stays constant for positive D r , while restoration will worsen system for each D r (Fig. 9b). For supercritical loading (L50.95), SLF increases for negative D r and decreases for positive D r , while early restoration (t r 51) will improve system for each D r (Fig. 9d). Restoration can improve system only for certain values of system loading for a given D r .

Conclusions
Proper restoration during cascading failures can actively prevent failure propagation through the entire network. We have proposed a novel modeling framework to investigate restoration effect during cascading failures with respect to restoration timing t r and strength p r . The model also considers additional disturbances on the system due to the restoration actions themselves. The effects of the restoration have been analyzed with respect to the mean number of failed components ES and the system load fluctuations SLF. ES focuses on the final state of the cascade-restoration process, whereas the newly introduced measure SLF describes the dynamical behavior of the systems. By applying the proposed modeling framework on the example system, we find that the restoration effects also depend on the combination of system loading level L and restoration disturbance D r . Although the system can be improved by proper in-process restoration, the application of restoration should be implemented carefully considering the system loading level. Our framework and findings can help to evaluate restoration scheme of complex systems and provide insights into the development of optimal restoration strategy against cascading failures, which are helpful for guiding improvements of reliability and robustness of actual network systems. Given the rapid development of Micro-Grid technology, it is interesting and necessary to study the restoration for Micro-Grid against cascading failures. Although, for now we have no data for the Micro-Grid, we will perform the relevant study in the future based on the framework provided in this paper. Based on our framework provided in the paper, more realistic scenario considering system real-time status can also be studied in the near future.