A Scalable Distribution Network Risk Evaluation Framework via Symbolic Dynamics

Background Evaluations of electric power distribution network risks must address the problems of incomplete information and changing dynamics. A risk evaluation framework should be adaptable to a specific situation and an evolving understanding of risk. Methods This study investigates the use of symbolic dynamics to abstract raw data. After introducing symbolic dynamics operators, Kolmogorov-Sinai entropy and Kullback-Leibler relative entropy are used to quantitatively evaluate relationships between risk sub-factors and main factors. For layered risk indicators, where the factors are categorized into four main factors – device, structure, load and special operation – a merging algorithm using operators to calculate the risk factors is discussed. Finally, an example from the Sanya Power Company is given to demonstrate the feasibility of the proposed method. Conclusion Distribution networks are exposed and can be affected by many things. The topology and the operating mode of a distribution network are dynamic, so the faults and their consequences are probabilistic.


Introduction
Electric power distribution networks are receiving greater attention both from administrators and end users in China as new construction of rural networks and smart grids proceeds. management strategy, the evolution of technology and an understanding of risk for the distribution network.
To describe risk in a distribution network, data of various types such as continuous load data or discrete user-level data should be collected. Current approaches tend to use fuzzy set theory to abstract or categorize data, but the coarse nature of fuzzy sets precludes further processing at a finer granularity.
Quantitatively, risk is defined as: Where Possibility is the likelihood of the occurrence of a particular fault and Loss is the consequence of that fault occurring. Currently, consequences in distribution network risk analysis are mostly measured by power loss, which is inadequate. As an example, a power loss would have different effects, both in economic and social terms, in a five-star hotel and in a rural village. Additionally, a consequence in a distribution network is not static because the topology and the operating mode can change. For a discrete time series, any set of disjoint regions b ¼ fC i g m 1 that covers the state space S is called a partition [20]; that is, If a unique symbol m2O, where O is a symbol set defined as {S0,S1,S2,. . .,Sm-1}, is assigned to a specific partition, then the representation of the time series data would be where i is the starting index of the symbolic in the symbol set O and m is the length of the symbol sequence. Similar to fuzzy sets, this symbolic representation can abstract the information, but this representation permits more flexibility and uncertainty than fuzzy sets. It is assumed that the dynamical system is stationary on the fast time scale and that any nonstationarity is observable only on the slow time scale. In symbolic dynamics, the slow time scale is typically defined as being at least two orders of magnitude larger than the fast time scale. For convenience, we define five levels to describe the risk in the distribution network, very high, high, medium, low, and very low, which can be represented by a symbol set O = {A,B, C,. . .,O}.

Risk Description Framework
We propose a risk description framework that includes device, structure, load and special operation factors, as Fig. 1 illustrates.
All risk indicators should be calculated independently according to the voltage level. For convenience, it is logical to organize the factors in a layered structure. Theoretically, the more data that are collected, the more accurate the evaluation of risk will be. Because the types of data may vary with location and time, the factor merging algorithm should be robust and flexible. As an example, the organization for the device indicators is given in Fig. 2.
The risk factors may also have sub-factors such as environmental effects, but these will not be discussed here. As China covers vast area, it is hard to adopt uniform risk factors framework. Practically, Risk indices selection and categorization is first carried out by national standards. Then, supplementary indices are integrated into the framework according to local data collection ability and management requirement.

Raw Risk Data Processing
Risk, as defined previously, is a relative value, so a baseline should be chosen for evaluation. For a distribution network risk evaluation, a day with fine weather, a light load, and no defects or malfunctions should be chosen as the baseline. The risk factors can be mapped linearly based on the baseline extreme values. A mapping process is described in the following.   1. Possibility Data Processing. For the device factor layers shown in Fig. 2, we calculate the relative level factor for a line or a substation as Where Device Count and Device Level are self-explanatory and Device(l) is the device number at a specific level. After the baseline factor is calculated, a mapping from the raw data to a symbolic sequence can be defined as follows: : Where Ind Max and Ind Min are the maximum and minimum values of the baseline calculations, respectively, and P Idx is the first symbol index in the symbol set. We chose three symbols to describe the risk probability and consequence. Other symbols are consecutive symbols after P Idx indicates. The symbols indicate weights W s of 60%, 30% and 10%, respectively. The first and second symbol weights are approximation of golden number, and the rest is allocated to the third symbol weight.
2. Consequence Data Processing. The risk factors can have either direct or indirect connections to a malfunction. For factors with a direct connection such as a device failure, the mapping is defined as follows: Where C Idx is similar to P Idx , MTTR Idn and MTTR Avg represent the affected factor recovery time and the total line or substation recovery time expressed in terms of MTTR (Mean Time To Repair), and Level in equation (7) indicates the relative importance of the line or substation. From equation (7), we observe that the line or substation level has a strong influence on the risk consequence.
For indirect connection factors, we convert the raw data in a relative manner. For example, the maintenance department risk consequence Con R for a distribution line can be calculated as Where MTTR IdnMngAvg is the average MTTR under a specific management staff and MTTR AvgAll is the average MTTR of all of the lines. The average MTTR is calculated from the line MTTR and the line length. For example, MTTR IdnMngAvg is calculated using equation (10) Where MTTR Avg includes all of the MTTRs under a specific management staff and Line MngLength is the corresponding line length.

Phase-Space Reconstruction
Once the factors in the risk description have been decided, the phase-space dimension and structure are determined. It is possible to recreate the entire trajectory of the system from measurements. Based on equation (6) and the symbol sequence representation, the sequence of state vectors is represented as: where{Ind k } is the sequence of the state vectors generated from the raw risk data processing andΔ2N is a time interval in the phase-space trajectory of the system determined by the observation rate. To reflect the layered structure of the risk factors, the factors are grouped according to their place in the risk description framework, such as in Fig. 2.

Symbolic Dynamics Operators
The processing of the raw data and the phase-space reconstruction were discussed in the previous section. The symbolic dynamics operators are presented in this section to establish a foundation for factor merging. definition 1: Sequence Index Operator Idx The sequence index operator is defined as wheret is the index of the first symbol in the symbol set O for a given symbol sequence. The risk probability and consequence are represented by a symbol sequence of 3, L X (3). definition 2: Shift Operator ! The shift operator is defined as where[] is the Gaussian function andWs(X1,l) and Ws(X2,l) are symbol weights as described in Section II Raw Risk Data Processing. Add operator would get the first symbol index of the result symbol sequence. definition 4: Multiplication Operator The multiplication operator is defined as The ratio operator is defined as where x is a positive rational number. As explained in definition 1, the operations in definitions 2-5 cannot exceed the symbol boundaries.

Results
As mentioned previously, raw risk factor data may vary with location and time. Therefore, it is critical to build a scalable framework. The framework in Fig. 2 is a scalable framework that enables users to add or remove factors as necessary. In this section, the system symbolic description and operators discussed in the previous sections are used to establish an algorithm for distribution network risk evaluation that is scalable.

Risk Factor Correlation
In a layered risk evaluation framework, risk sub-factors contribute to higher-layer risk factors. Because sub-factors may have different effects on the main factor, it is important to measure the relationship between the sub-factors and the main factor. A statistical method is used with the assumption that the more information is available, the more accurate the evaluation of risk will be. Therefore, we calculate the correlations of risk factors, as illustrated Fig. 3 and as described in the following.
1. Symbol Distribution Calculation. Based on the state vector {Ind k } in equation (11), a symbol j in symbol setOhas the distribution probability where k is the time index of the state vector, if symbol j exists, over which the symbol weights are accumulated and W s (Max) is the maximum symbol weight, say 60%.
Then, for a specific risk factor, the symbol distribution probability is calculated. For the main risk factor, the overall symbol distribution probability can be calculated as where r is the number of main risk factors, W s (d,l) is the corresponding dth sub-factor symbol weight.
2. Entropy Calculation. After the symbol distribution probability has been calculated for the risk sub-factors, likeness for the sub-factors can be analyzed, which is useful for grouping them.
We use Kolmogorov-Sinai entropy, which is defined in equation (19), to measure the randomness of the risk factors, and we use the Kullback-Leibler distance in equation (20) to quantify the likeness of the risk factors.
3. Factor Grouping. It should be noted that the randomness of the symbol distributions affects the accuracy of the Kullback-Leibler distance. Therefore, we define a refined measurement as This measurement was chosen such that, even if the Kullback-Leibler distance is small, a high degree of randomness in the Kolmogorov-Sinai entropy reduces the possibility of two risk factors belonging to the same group, and vice versa. The group threshold is set at 2 to allow the largest possible grouping of similar sub-factors. For a specific main risk factor, its n sub-factors are grouped into f categories of risk subsets.
The goal is to correlate sub-factors to main factors through the symbol distribution probabilities, but if certain types of data are more abundant than others, the information in the lessabundant data may be obscured. Grouping data into categories not only reduces the dimension of the space, which further simplifies the process, it can reveal information that would otherwise be lost.
4. Quantification of Correlations. This step attempts to relate the f categories of risk subsets to the main risk factor. We will describe the process using an example.
Without loss of generality, assume that a category f12f has m1sub-factors. We could then construct the time series of f1as in equation (11) In this manner, the state vector phase-space is reduced from m to f. From equations (17)-(21), we can calculate the distance between the f categories and the distribution of all the indicators as This is the quantitative distance between the sub-factor set and the main risk factor. The quantitative correlation coefficient is defined as This equation indicates that both the Kullback-Leibler distance and the number of factors in a sub-factor set contribute to the correlation coefficient and that the Kullback-Leibler distance has more influence.

Risk Factor Merging
Merging of risk factors simplifies the calculation of main risk factors from sub-factors. Merging is defined as where n is the number of sub-factor groups.

Event Risk Calculation
Prior to this step, the risk is calculated from the failure probability and the consequence independently. For a specific component, line or substation, we can calculate the overall risk as However, this definition, which was derived from equation (1), is mainly a statistical result. The variable nature of risk is not included. Therefore, an improvement is desired.
Using the shift operator in definition 2 on the phase-space in equation (11), we can obtain another time series vector for some value of g. Referring to the factor grouping and factor merging methods, we can define a fluctuation parameter as To merge the four major types of risk factors, we define the merge operation given by equation (28), whereDevice Ind , Struct Ind , Tech Ind and Load Ind are the device, structure, special operation and load factors, respectively, and Ind Society and Ind Weather are social and weather effect parameters, respectively, selected in accordance with the norms of that locality. The final overall risk is defined as This equation shows that greater diversity in the sub-factors results in greater risk of the event.

Algorithm Discussion
Risk is a relative concept based on probability theory. In distribution networks, if remedial measures and schedule planswere included in the risk evaluation, the failures and the losses would all have probabilistic characteristics. The proposed method is built on symbolic dynamics, and the result is intuitive, which is helpful in management. The following discussion further explains the concepts and the implementation of the method. 1. Information Abstraction. A distribution network is a complex dynamic system that involves many types and large volumes of data. Therefore, information abstraction is very important.
Because risk is a relative concept, a linear demarcation of baseline data for basic risk standards as described in equations (5) and (6) is feasible, but this approach is not accurate. Furthermore, because the probabilistic nature of risk leads to vagueness in its evaluation, symbolic dynamics are used to incorporate language vagueness in the risk description.
Compared with fuzzy set abstraction, which expresses a variable using a definite category, symbolic dynamics use a symbol sequence to describe a variable, which enables further information processing.
Because all raw data are mapped into the symbol set, further uniform processing can be achieved. This relative processing technique conforms to the risk concept. The method of merging risk factors using symbolic dynamics operators offers a new way to compute risk factor relationships.
2. Probability-based Analysis. Risk is a probability-based concept. For layered risk factors, sub-factors can affect risk factors in higher layers. Under these assumptions, the symbol distributions are calculated to reflect the failure and consequence probabilities. The Kullback-Leibler distance is used to measure the relationships between the sub-factors and the main risk factors and can be used as coefficients to adjust for variable randomness.
3. Scalability. The data may vary with location or time, and they may have different numbers of sources. Thus, the risk framework should scale according to the data sources and should not allow data from more sources that is greater in volume to mask information conveyed by data from fewer sources.
Equation (22) provides the mean value to describe a risk factor category, and equation (25) merges risk factor categories. Regardless of the number of sub-factors in a risk category or the number of categories, this method provides a uniform risk value. Therefore, the risk can be calculated for the same description framework regardless of the number of data sources, which permits scalability.
4. Complexity Analysis. Table 1 gives the approximate computational complexity estimates for the various steps in the algorithm, assuming the basic dimension of description state vector is m, as in equation (11).
5. Algorithm Acceleration. The most time-consuming processes are the symbol distribution, the factor grouping and the factor merging. Because these operations are based on historical data, they can be performed at system initialization and then updated periodically. In this manner, each step can be reduced to O(m)complexity, which is very desirable.
6. Multiple Granularity Management. Because the risk factors are layered, grouped and calculated, the risk failure and consequence distributions can be calculated and the correlations between risk factors can be tracked. Therefore, management of risk factors with differing granularities can be implemented. From the risk failure and consequence distributions, proper countermeasures may be taken. From the correlations between risk factors, counter measures can be prioritized.

Discussion
The Sanya Power Company supplies power to Sanya, a popular tourism site in China. The company controls three 220kV substations, twelve 110kV substations, seven 220kV lines, twenty-four 110kV lines and one hundred and eighty-one 10kV lines. Sanya is a tropical island, so its distribution network is prone to disruptions from weather and other environmental factors. Therefore, risk management is very important for improving reliability. Fig. 4 shows the high-voltage distribution network in Sanya. Certain substations or lines that are not under the SPC's administration are included to simplify the calculations.
The risk was given five levels, as shown in Table 2. In our research, following data were collected: Various loads, weather conditions and community activities may affect the overall risk, as equation (28) indicates. However, risk is a relative value. In our evaluation for 2013, the baseline was set at the minimum overall load day in 2009. We will list three line analysis results in  this section, namely Yali II, Yali I and Yatian, which are high-risk transmission lines. The results for the substations are omitted for brevity. In this example, the maximum overall load day and a typical rainy day in 2013were selected. The baseline, the maximum overall load day and the rainy day in 2013were evaluated using the parameters given in Table 3.
The sub-factor correlation coefficients obtained from equations (22) to (24)are listed in Table 4.
From Table 4, we conclude the following: 1. 1. The user level and the maintenance are highly correlated with the overall risk.

Because
Yali II is a relatively new line, defect management for that line has less of an effect than it does with the older lines, Yali I and Yatian.
The structure indicator varied insignificantly in our three evaluation examples. Fig. 5 and Fig. 6 give the symbol distributions for a 110kV line structure failure and consequence, respectively. Table 5 lists the corresponding symbol distribution probabilities.
The overall device risk indicators calculated from equations (25) and (26) are listed in Table 6.
As can be observed from Table 6, the 110kV line is relatively reliable. A comparison with a high-risk 10kV line is given in Table 7.
The overall risk values are listed in Table 8. To compare high-voltage risk characteristics, we give the risk analysis results for several10kV lines in Table 9.
Although algorithm in this paper has little constraints on data availability. As precision for symbolic dynamics data based abstraction and entropy based correlation evaluation, lack of data would have great impact on the rationality of the result. The more data available, the more precise the result is.
Although the categorization would also influence the final result, data categorization can be carried out under national or provincial system monitor, maintenance guidance, which would leads to uniform categorization in a relatively large area.

Conclusions
Distribution networks are exposed, and their operation can be disrupted for many reasons. Because the topology and the operating mode of a distribution network are dynamic, failures and their consequence are probabilistic in nature. This study investigated a risk evaluation method based on symbolic dynamics. Because of the relative nature of risk, symbolic dynamics is used to abstract the information contained in raw data. To accommodate a layered framework for risk factors, symbolic dynamics operators were discussed. To analyze the relationships between risk factors in a layered structure, quantitative correlation values were obtained using the Kullback-Leibler distance and Kolmogorov-Sinai entropy in the symbol distribution analysis. A method for merging risk factors using the symbolic dynamics operators that enables the management of risks with multiple granularities was discussed. Finally, the method was demonstrated using an example from the Sanya distribution network.