Local Difference Measures between Complex Networks for Dynamical System Model Evaluation

A faithful modeling of real-world dynamical systems necessitates model evaluation. A recent promising methodological approach to this problem has been based on complex networks, which in turn have proven useful for the characterization of dynamical systems. In this context, we introduce three local network difference measures and demonstrate their capabilities in the field of climate modeling, where these measures facilitate a spatially explicit model evaluation. Building on a recent study by Feldhoff et al. [1] we comparatively analyze statistical and dynamical regional climate simulations of the South American monsoon system. Three types of climate networks representing different aspects of rainfall dynamics are constructed from the modeled precipitation space-time series. Specifically, we define simple graphs based on positive as well as negative rank correlations between rainfall anomaly time series at different locations, and such based on spatial synchronizations of extreme rain events. An evaluation against respective networks built from daily satellite data provided by the Tropical Rainfall Measuring Mission 3B42 V7 reveals far greater differences in model performance between network types for a fixed but arbitrary climate model than between climate models for a fixed but arbitrary network type. We identify two sources of uncertainty in this respect. Firstly, climate variability limits fidelity, particularly in the case of the extreme event network; and secondly, larger geographical link lengths render link misplacements more likely, most notably in the case of the anticorrelation network; both contributions are quantified using suitable ensembles of surrogate networks. Our model evaluation approach is applicable to any multidimensional dynamical system and especially our simple graph difference measures are highly versatile as the graphs to be compared may be constructed in whatever way required. Generalizations to directed as well as edge- and node-weighted graphs are discussed.


Introduction
The study of complex system dynamics by means of complex network theory has thrived in recent years [1,2]. Applications cover many branches of science as virtually any multidimensional dynamical system may be considered as a network of dynamically interacting components. For example, climate or brain functional networks are thought to consist of nodes representing climate dynamics or neural activity in different geographical or brain regions and links between them characterizing interregional covariabilities [3][4][5][6][7].
Recently, Feldhoff et al. [8] have proposed to use complex networks for climate model evaluation comparing functional networks built from simulated and observed climate data. In contrast to most commonplace methods, this approach offers a multivariate evaluation perspective and the authors have demonstrated its complementarity to traditional univariate methods based on mean values and variances. Beyond climate science, the fundamental idea of a model evaluation based on functional networks applies to numerous kinds of real-world dynamical systems. For instance, it could be used to evaluate models of social dynamics [9], financial markets [10], neural activity [11,12] or genetic regulatory systems [13,14].
A quantification of modeling accuracy following the new paradigm requires measures of difference between complex networks with known node correspondence. Those put forward thus far usually compare community [15,16] or neighborhood [8,[17][18][19] structures and boil down network dissimilarities to one single number. While this is advantageous from a model rating point of view it is of limited use when an evaluation aims at model improvement. In that latter context a more detailed account of network differences is required in order to identify those system components whose interaction with the rest of the system is simulated most deficiently. This motivates the work on local network difference measures presented in this article.
There already have been methods proposed in the literature to quantify the structural similarity of network nodes. The older and simpler concept of structural equivalence considers nodes similar if they have many common network neighbors [20,21] while the more recent concept of regular equivalence considers nodes similar if the nodes they are connected to are themselves similar [22,23]. The latter concept requires transitivity of similarity through network links and therefore does not apply to networks where links represent dissimilarity. Since such networks can occur in the model evaluation context as exemplified in the application section we define local network difference measures in accordance with the concept of structural equivalence. To our knowledge, we are the first to employ this concept to quantify the structural (dis)similarity of nodes from different networks.
To illustrate the general scope of application of the new evaluation approach, let us consider an N-dimensional dynamical system of which we know the time evolution within a certain time frame, i.e. we know x i (t) for i = 1, . . ., N and t = 1, . . ., M. A description of the system dynamics according to the complex network approach is then based on statistical relationships between the time series as quantified by some measure of statistical dependence. Measures that have been used for this purpose include linear correlations [24,25] as well as nonlinear measures of synchronization [26,27] and mutual information [4,28]. The dependence measure of choice is calculated for every pair of time series, which yields an N × N matrix of relationship coefficients r ij .
Translated into network language, the N different components or dimensions of the system are the nodes or vertices of a network and the coefficients r ij constitute links or edges of varying strength between them. This identification enables the application of network structure analysis tools, aiming at an improved understanding of the collective dynamics of the system under study. To simplify the analysis, the matrix of relationship coefficients is frequently mapped to a binary adjacency matrix with components a ij = 1 if nodes i and j are connected and a ij = 0 otherwise [29,30]. These unweighted links are typically assigned according to the significance of the respective statistical relationship. Now let us consider the system to be a real-world one that is to be modeled, i.e. let us assume we have x A i ðtÞ from measurements and x B i ðtÞ from model simulations and that we want to assess how close the simulated dynamics are to the observed. Then the complex network approach allows for an evaluation of the interdimensional covariabilities of the system by comparing ðr A ij Þ with ðr B ij Þ or ða A ij Þ with ða B ij Þ. In the following section we define local difference measures between complex networks represented by matrices of the (r ij ) and (a ij ) type. We then demonstrate the capabilities of these measures in a climate model evaluation case study. Potential generalizations of our theory and applications beyond climate model evaluation are discussed at the end.

Local network difference measures
In this section, after briefly recalling some basic graph theoretical notions, we first define a local version of the Hamming distance [29,31] between binary adjacency matrices, describe a practical problem coming along with its application, and present a solution, which culminates in the definition of a new local network difference measure. Subsequently, we discuss the cases of edge-weighted and node-weighted networks. For the sake of simplicity, only undirected networks are considered throughout this section; a generalization to directed graphs is mentioned in the discussion section.
We use the superscripts A and B to discriminate between the networks to be compared. According to the model evaluation context we assume that these networks share a common set of N nodes, and that they have been constructed from (A) observed and (B) simulated data following the same recipe, as outlined in the introduction and exemplified in the application section. Because the diagonal elements of the (r ij ) and (a ij ) matrices encode statistical relationships between identical time series, they are disregarded by all difference measures defined hereafter.

Simple graphs
Based on the above, let us consider two networks represented by symmetric binary adjacency matrices ða A ij Þ and ða B ij Þ with diagonals set to zero. Such networks are known as simple graphs in the mathematical literature. Nodes which are connected are called neighbors. The set of all nodes connected to node i is i's neighborhood. The number of its neighbors is i's degree and can be written as The average degree over all nodes, divided by N − 1, is called the link density ρ of the network. The global Hamming distance (GHD) between two simple graphs A and B has been defined to be [29,31] HðA; BÞ ¼ We define its local version by relating the neighborhoods of node i in both networks. The local Hamming distance (LHD) H i (A, B) counts the number of nodes, which are either a neighbor of i in network A but not in B or vice versa. Global and local Hamming distances are related by H(A, B) = ∑ i H i (A, B). Differently from [29], we refrain from normalizing the LHD to the number of nodes in this section, because it would make the following mathematics less convenient. Fig. 9(j) shows LHDs between climate networks over South America with degree fields displayed in Figs. 4(i) and 4(j). (We are going to elaborate on the background of these networks in the application section; for the moment the reader is asked to take them as given.) We observe large LHDs in locations with large degree discrepancies, for example around 10°S, 50°W. Moreover, larger LHDs tend to occur in locations with larger degrees. For instance, small degrees coincide with small LHDs along the western coast of South America.
These observations indicate that the LHD is not an ideal local network difference measure.
Equal LHD values at these nodes would mean different relative agreements of neighborhoods. It is therefore difficult to interpret LHD values without considering the degrees. We would like to have a more intuitive difference measure which quantifies the dissimilarity of neighborhoods relative to their size. An equivalent problem has been encountered by those studying vertex similarity concepts [21]. While, to our knowledge, this community has always made do with ad hoc renormalizations [32][33][34], we take a different approach in the following.
Our point of departure is a statistical null model which explains the observed LHD-degree dependence: Let i be some fixed node with degrees k A i in graph A and k B i in graph B, and let us assume i's neighborhoods in A and B to be statistically unrelated. This can be modeled considering a A ij and a B ij (j 6 ¼ i) to be random binary variables that are statistically independent between networks and equal to one with identical probability within networks. Using Eq. (3), this null model yields an LHD expectation value of hH i ðA; Beyond illustrating the relationship between LHD and degree, this null model can be used to define a new local network difference measure, which is degree-independent. To that end, we relate the actually measured LHD value H i (A, B) to the null model probability distribution of possible LHD values for the degrees k A i and k B i . More specifically, H i (A, B) is mapped to its null model p-value, i.e. to the probability of the LHD to take a value less than or equal to H i (A, B), if i's neighborhoods of size k A i in A and k B i in B were statistically unrelated. We now derive a formula for those p-values, i.e. we derive the cumulative distribution function (CDF) of LHDs generated by our null model. According to Eq. (3) we can write The sum in Eq. (5) counts the number of common neighbors of i in A and B. We denote this number by N i (A, B) and have For fixed k A i and k B i , this implies a one-to-one correspondence between N i (A, B) and H i (A, B), with changes of the former by +1 corresponding to changes of the latter by −2. Let us denote the null model generated LHDs by h ¼ k A i þ k B i À 2n in the spirit of Eq. (6). The null model probability of node i to have exactly n common neighbors in the two networks is equal to the probability of having exactly n successes in k B i draws without replacement from a population of size N − 1 containing k A i successes. This probability is given by the probability density function (PDF) of a hypergeometric distribution [35,36]. Note that P n; k A i ; k B i ; N À Á is symmetric with respect to exchanging k A i with k B i . The desired p-value formula results from Eq. (7) via In order to numerically evaluate this complementary CDF of a hypergeometric distribution, we use the HyperQuick algorithm [37]. This algorithm has the advantage of being easily logarithmized, which is important because of the following observation. The p-values of LHDs between climate networks on observational and modeled rainfall data studied in the application section turn out to typically be many orders of magnitude smaller than one. In other words, the climate models studied here are much better than our null model. Since the latter corresponds to the Configuration Random Network Model [1], this confirms the GHD findings by Feldhoff et al. [8]. To still be able to visualize p-values, we move to their logarithms. The agreement of our analytical result for p H i ðA; BÞ; k A i ; k B i ; N À Á and its Monte Carlo simulation is depicted in Fig. 1. Unfortunately, log p is degree-dependent, again. For instance, for its minimum value we find that, per simultaneous increment of both degrees, min h log p decreases by approximately −log ρ for small link densities ρ. To overcome this problem, we rescale log p by its minimum value, thus defining a new local network difference measure which we call the degree-conditional neighborhood dissimilarity (DND). Note that D i (A, B) is only defined if k A i and k B i are both positive. The definition is independent of the base of the logarithm.
The DND can only take values in [0, 1], with a value of zero (one) meaning maximally (minimally) overlapping neighborhoods of node i in the two networks, given the degrees. Note that D i (A, B) = 0 does not imply a local agreement between networks. Rather, it means the greatest possible agreement for given degree differences. The DND should therefore always be considered together with either the LHD or the degrees of the compared networks.
In the case of a comparison of spatially embedded networks, the LHD shows embedding artifacts such as boundary effects [38], which are inherited from the degree. Since the transformation of LHDs to DNDs frees the former of its degree dependence, the latter does not show the according artifacts. As an example, compare the LHDs between regional climate networks in Fig. 9(i-l) with their respective DNDs in Fig. 9(m-p).

Edge-weighted graphs
As outlined in the introduction, simple graphs describing complex system dynamics usually result from a binarization of a matrix of statistical relationship coefficients r ij . Such a procedure artificially degrades the network information content. Moreover, it usually involves the introduction of binarization parameters, which many properties of the resulting simple graphs depend on.
In the model evaluation context it may be desirable to omit this problematic procedure, i.e. to directly evaluate the simulation accuracy with respect to the (r ij ) matrix. To that end a local network difference measure of such matrices is needed. We will now introduce one for r ij being Pearson correlation coefficients.
In formal analogy to the (normalized) local Hamming distance, we define the local correlation distance (LCD) by F: ðÀ1; 1Þ ! ðÀ1; 1Þ; r7 !arctanh r: The use of the Fisher transformation [39] F in the LCD definition is motivated as follows. The confidence interval width of Pearson correlation estimates depends on their value. An r ij value around 0 usually has a wider confidence interval than a value close to ±1. Therefore, at correlation coefficient values around 0, we expect j r A ij À r B ij j to be greater than at values close to ±1.
The use of Fisher transformed coefficients balances this disparity. For normally distributed time series, the standard estimation error of F(r ij ) is approximately independent of r ij [39][40][41][42]. Thus, in the definition of Eq. (11), differences of r ij values close to ±1 contribute as much to C i (A, B) as those at values around 0. Furthermore, using the non-transformed correlation coefficients in Eq. (11) would make C i (A, B) dependent on P j6 ¼i r X ij , X = A, B, analogously to the LHD-degree dependence on simple graphs. Employing the Fisher transformation prevents this dependence and renders a DND analog for edge-weighted graphs unnecessary.
For later reference, we define the global correlation distance (GCD) by Basically, the ansatz just outlined for the Pearson correlation may be applied to any dependence measure. However, as we have tried to argue, prior to any distance calculation, dependence coefficients should be transformed such that their uncertainties become valueindependent. This should be seen as an incentive to develop currently lacking estimation error theories for statistical dependence measures.

Node-weighted graphs
To conclude this section, we shortly discuss a generalization of our approach to networks featuring node weights w i , which we assume to be equal in networks A and B. For instance, nodes of a climate network may represent differently sized portions of the earth's surface.
Given such networks, it is desirable to define network measures, that take node weights into account. Following a general solution to this problem proposed in [43], we obtain the nodeweighted version H Ã i of the LHD, Averaging H Ã ðA; BÞ ¼ P i w i H Ã i ðA; BÞ yields the node-weighted version of the GHD [8]. Analogously, we can define a node-weighted LCD by Deriving a node-weighted version of the DND is less straightforward. In formal analogy to Eq. (10), D Ã i ðA; BÞ should be a function of H Ã i ðA; BÞ and the node-weighted degrees In contrast to k i though, any rearrangement of i's neighborhood potentially changes k Ã i , depending on the spectrum of node weights. The combinatoric null model approach that led from LHD to DND is thus unsuitable for a derivation of D Ã i ðA; BÞ. We could numerically compute the node-weighted null model CDF using Monte Carlo techniques, but a general closedform analytical solution appears to be out of reach.
As the node-weighted DND should equal the DND in the case of constant node weights, we can nevertheless provide an approximation to the unknown D Ã i ðA; BÞ based on an approximation to D i (A, B): In Fig. 1 we observe that the function log p h; k A i ; k B i ; N À Á may be approximated by a straight line through its extreme points, preserving its strict monotonicity and range of values. Substituting log p in Eq. (10) accordingly yields with minimum and maximum LHD values for fixed degrees of minh ¼j Since the hypergeometric distribution is log-concave [44], log p h; k A i ; k B i ; N À Á is a concave function in h [45]. Consequently, the approximate DND values are always less than or equal to their true counterparts. Expression (18) shows that the DND is approximately equivalent to the structural vertex similarity measure introduced in [34].
For fixed node-weighted degrees, the minimum and maximum node-weighted LHDs are Hence, an approximation to the node-weighted DND is given by Application to climate model evaluation As alluded to in the introduction, Feldhoff et al. [8] have demonstrated the complex networks approach to climate model evaluation, comparing the performance of two regional climate models (RCMs) over South America. Specifically, the authors constructed networks of 2m temperature, precipitation, 500hPa geopotential height and sea level pressure time series from hindcast climate simulations and ERA-Interim reanalysis data [46]. The GHD between model and reanalysis networks was employed as the error measure, with respect to which the statistical model STARS outperformed the dynamical model CCLM for all variables except the 500hPa geopotential height. The ERA-Interim reanalysis data are the result of a 12-hourly assimilation of historical meteorological observations and their integration by a numerical weather prediction (NWP) model. The data quality varies from variable to variable and depends on the relative influences of the observations and the NWP model [47][48][49]. Among the variables considered by Feldhoff et al. [8], precipitation is the most uncertain one as it is difficult to model and since no precipitation measurements are included in the assimilation of ERA-Interim.
Therefore, in the following, we reevaluate the quality of networks of precipitation time series modeled by CCLM and STARS, this time with respect to networks of observed precipitation data, and using our new local network difference measures. Before discussing results, we briefly introduce the models and the observational dataset, specify the precipitation climatology to be modeled, and describe our network construction methodology.

Models
The RCMs CCLM [50,51] and STARS [52,53] differ fundamentally in their approach to climate modeling. While the statistical model STARS takes climate data as input and stochastically resamples them such that a prescribed trend of some variable is matched, the dynamical model CCLM is the climate version of a NWP model, i.e. it numerically solves the differential equations governing the dynamics of the atmosphere under prescribed external drivers such as the solar radiative forcing and greenhouse gas concentrations.
For detailed descriptions of both models and the experimental setup of the evaluation see [8] and references therein. In short, ERA-Interim reanalysis data was used to force both models. With STARS, daily mean value time series from 1979 through 1995 were resampled in order to reproduce the 1996-2011 temperature trends. CCLM was run on the CORDEX South America domain [54,55], simulating 1979 through 2011 to ensure a proper model spin-up during the first half of the simulation period. The daily mean precipitation time series were then conservatively interpolated [56] back to the ERA-Interim grid in order to enable network constructions on a common node set. We also include the ERA-Interim reanalysis data (ERAI) in our reevaluation.
While there is only one CCLM run and one reanalysis dataset, with STARS an ensemble of 200 climate realizations was generated. All quantities evaluated in the following are calculated for each realization individually and then averaged across the ensemble, i.e. any STARS result shown represents the respective ensemble average.

Precipitation data and climatology
Arguably the best observational precipitation dataset available for tropical South America is the Tropical Rainfall Measuring Mission (TRMM) 3B42 V7 daily satellite product, which starts in 1998 and comes at a native resolution of 0.25° [57]. The product is based on measurements by radar, infrared and microwave sensors aboard numerous satellites, and calibrated by station data. While it is available up to 50°latitude, the TRMM precipitation radar data only extend to 36°latitude so that we constrain our analysis to latitudes north of 40°S, where the data are most reliable. To facilitate a proper network comparison, the TRMM data are conservatively remapped to the native ERA-Interim grid which has a resolution of about 0.7°.
We concentrate our analysis on the austral summer season DJF, when a monsoon system develops over tropical South America and provides for most of the annual precipitation [58][59][60]. The region under study tightly encompasses the South American mainland and is depicted in Fig. 2, where we show the DJF mean values and 90th percentiles of daily precipitation amounts as measured by TRMM and modeled by CCLM, ERAI and STARS from 1998 through 2011. The TRMM data exhibit the typical pattern of abundant rainfall in the Intertropical and the South Atlantic Convergence Zone (ITCZ and SACZ, respectively), and along the eastern slopes of the Andes.
These main rainfall patterns are replicated by the models but we find substantial differences in intensities. Regarding the seasonal mean values, CCLM mostly underestimates rainfall, while ERAI and STARS are closer to TRMM except along the Andes, where we find strong overestimations. The 90th percentiles, which quantify the intensity of extreme rain events, are mostly and substantially underestimated by all models. While TRMM shows values greater than twice the respective mean values throughout the study region, the models simulate a smaller ratio at most locations. Such intensity underestimations of extreme rainfall events are shared by many climate models [61]. M = 1170 days. Our domain/network comprises N = 5460 grid cells/nodes. We focus on networks without node weights to enable an application of the DND. This is reasonable since LHD and GHD results do not change qualitatively when node weights representing grid cell sizes are introduced, which is because grid cells do not vary much in size within the domain. To capture different aspects of the precipitation dynamics, we apply two statistical dependence measures-the Event Synchronization (ES) [62] and the Spearman Rank correlation coefficient (SR) [63].
The ES may be used to analyze the spatial synchronicity structure of extreme precipitation events [27] which has been done over South America based on the TRMM data used in this study [64]. We adopt the network construction methodology described in the latter study. It is based on a transformation of precipitation time series to binary extreme event time series as depicted in Fig. 3. At each location, daily precipitation above the 90th percentile of its climatological (DJF 1998-2011) distribution is defined as an extreme event. Grid cells at which the 90th percentile is zero in any of the observed or simulated datasets [hatched in Fig. 2(e-h)] are discarded from the analysis. Between two extreme event time series at different locations, the ES then quantifies the degree of event synchronization, with two events contributing to ES if they can be uniquely associated within a maximum period of 3days. Since no variance-stabilizing transformation is known for ES, we confine the ES network evaluation to the respective simple graph whose links represent the ρ = 2% highest ES values and which we call ESp.
Besides focusing on extremes, we also aim at evaluating the general spatiotemporal precipitation dynamics. To that end we employ the SR approach proposed by Feldhoff et al. [8], which requires preprocessing of the original precipitation values to anomalies with respect to the 1998-2011 climatology (Fig. 3). We first apply a 3-day moving average filter to the daily precipitation time series so as to make them represent synoptic weather situations. In order to prevent trivial correlations due to seasonality, these smoothed time series are then rendered approximately stationary in mean and variance. We achieve this by rescaling the time series of each individual DJF season to their respective climatological (1998-2011) average smoothed by a 7-day Gaussian filter (blue curve in Fig. 3). We prefer this division by climatological daily values over their more conventional subtraction, because only the scaling guarantees a common rank of all zero precipitation events. To avoid division by zero we define a minimum climatological value of 0.1mm/day, which customarily is referred to as the smallest measurable daily precipitation amount. Statistical dependences between the resulting anomaly time series are then quantified by the SR at lag zero; ties are handled according to the mid-rank method [65]. From the matrix of rank correlation coefficients we derive three networks. The matrix as it stands defines an edge-weighted network, which we denote by SR. Simple graphs representing the 2% most positive and the 2% most negative SR values are defined to disentangle these two different kinds of correlation, and are denoted by SRp and SRn, respectively.
Altogether and schematically, our networks are constructed as follows: where x i (t) are the local daily precipitation time series, P represents their preprocessing to anomaly or extreme event time series y i (t) (cf. Fig. 3), S = SR, ES marks the application of a statistical dependence measure to all pairs of time series which results in a matrix (r ij ) of correlation or synchronization strengths, and T denotes the thresholding that yields the simple graph adjacency matrices (a ij ) according to a ij ¼ Y½tðrÞ À r ij for SRn; Y½r ij À tðrÞ À d ij for SRp; ESp; with the Heaviside function Y, the Kronecker delta δ ij , and the threshold τ set such as to obtain the desired link density ρ.
It should be noted that the modeling accuracy of these networks is, in principle, independent of general rain amount biases. Two pairs of time series can have the same correlations even if standard deviations or means of the time series differ. Similarly, two pairs of (extreme) event time series can agree in synchronicity despite disagreement in the event defining threshold. Therefore, the discrepancies between observed and modeled precipitation mean values and 90th percentiles we see in Fig. 2 do not preclude agreement in the SR or ES network structures.

Network topologies
Prior to an application of our novel network difference measures we investigate the individual network topologies starting with mean Fisher transformed rank correlations, and simple graph degrees as shown in Fig. 4. The f i field is mostly positive which shows that positive correlations between anomaly time series predominate. The few locations of negative f i differ in position between datasets. Spatial contrasts between f i values are least pronounced for TRMM and most for CCLM. Although SRn and SRp links encode fundamentally different statistical relationships, the degree fields of the respective networks roughly agree in exhibiting and locating three distinct regions of enhanced degree indicated by colored boxes in Fig. 4(i). All models reproduce this general pattern. As to the climatological interpretation of the SRp network we notice that the regions of large degree are adjacent to the zones of most abundant seasonal rainfall (cf. Fig. 2), while over those zones themselves (Amazon basin, ITCZ, SACZ) degrees are low, just like over the very dry southern hemisphere Pacific coast. This is explained as follows. Precipitation time series in very dry regions are mostly constant and can therefore not be correlated to other more rainy, less constant time series. The abundant rainfall in the convergence zones is associated with frequent localized convective rain events [59], hence the low correlation levels here. In contrast, wet and dry periods alternate in the intermediately wet regions adjacent to the convergence zones, which causes large intraregional correlations there. We show below that the SRp network links are indeed purely intraregional. There we also give a climatological interpretation of the SRn network. For a detailed climatological interpretation of the ESp network we refer the reader to [64].
We proceed by analyzing the dependence of mean correlations and simple graph link probabilities on the geographical distance of nodes (Fig. 5). While it is known that correlations between precipitation time series decay with distance [66] and a similar behavior has been found for synchronizations of extreme precipitation events [27], details of the SR and ES decays have not yet been compared and we do not know at which distance to expect the anticorrelations represented by the SRn network. Fig. 5(b) shows an approximately exponential decay of F(r ij ) values with node distance for short distances, followed by predominating anticorrelations for intermediate distances from about 15°to 55°. The distribution of F(r ij ) values becomes bimodal at distances beyond 60°, with positive and negative modes representing correlations and anticorrelations between anomaly time series in the diagonally opposite corners of the domain; deviations of mean F(r ij ) values from zero at those distances are therefore not statistically significant. While all models reproduce these characteristics, STARS and particularly CCLM overestimate the absolute For the three simple graphs, the link probability as a function of geographical node distance is shown in Fig. 5(c); the respective absolute number distribution of node distances is depicted in Fig. 5(a). Most of the 2% strongest anticorrelations turn out to connect anomaly time series 10°to 40°away from each other. SRp and ESp link probabilities decay differently for distances beyond about 10°. In line with F(r ij ) values, SRp link probabilities go to zero around 15°, while some nodes much farther apart are linked in the ESp graph. Distinctly positive general rainfall anomaly correlations are hence confined to be short ranged, while some of the 2% strongest extreme rainfall event synchronizations over South America are found between locations thousands of kilometers apart. Across graph types, CCLM produces too many short-and too few long-range links, which means that the model underestimates the strength of teleconnections relative to local anomaly correlations and event synchronizations. ERAI and STARS show less coherent deviations from TRMM with the reanalysis following the observations most closely, overall.
To shed some more light on the network topologies, Fig. 6 depicts RGB color coded connectivities to the three regions C, M, Y defined by colored boxes in Fig. 4(i). In the TRMM SRn graph, we find nodes south of Y connected to C and nodes in and northeast of C connected to Y. Moreover, we find nodes north of Y connected to M and nodes west of M connected to Y. Rainfall dipole patterns underlying these regional connectivities have been studied in the climatological literature. The M-Y anticorrelation has been associated with active and break  Fig. 4(i) for the TRMM, CCLM, ERAI and STARS (from left to right) SRn, SRp and ESp (from top to bottom) graphs. For each node i, the connectivity to the three regions [marked again by black boxes in (e)] is transformed to an RGB color with additive color mixing. We use cyan, magenta and yellow for links to C, M and Y, respectively, with color intensities proportional to the number of links between i and the respective region. In formulas, if i is connected to n C , n M and n Y nodes in boxes C, M and Y, respectively, we calculate an 24-bit RGB color code of (255 n C /100, 255 n M /100, 255 n Y /100) as all boxes contain exactly 100 nodes, and apply an additional hue shift by 180° [67]. Note that, since the hypothetical case of a node connected to all C, M and Y nodes does not occur, white has a purely decorative meaning. Local Network Differences and Dynamical System Model Evaluation phases in the South American monsoon system [68]. The C-Y one corresponds to the wellstudied SACZ seesaw pattern, which is caused by middle-latitude frontal systems propagating into the tropics [69,70]. In contrast to SRn, the SRp and ESp graphs are dominated by shortrange links and no interregional connections are found. The models reproduce the general patterns of connectivity to the regions C, M, Y for every graph type, yet with reduced accuracy for SRn compared to SRp and ESp. In particular, anticorrelations between C and the maritime SACZ are underestimated by every model (cf. Fig. 4), and CCLM overestimates the strength of the M-Y anticorrelation.

Network differences
We now study deviations of modeled from observed precipitation networks as quantified by the difference measures introduced above. Yet before going local we take a look at global network differences to analyze variations in climate model performances with network type and to connect to the Feldhoff et al. [8] results.
Global network differences. Global correlation and Hamming distances of modeled to TRMM precipitation networks are depicted in Fig. 7. Like Feldhoff et al. [8], we relate climate model performances to those of random network models which preserve certain properties of the respective TRMM network. The GHDs in Fig. 7 have been renormalized by 2ρ(1 − ρ)N(N−1), the expected GHD to an Erdős-Rényi random network with the same link density [8,29,71]. This type of random graph represents the performance of the worst possible climate model, i.e. one that simulates graphs that have nothing in common with the TRMM graph except the link density.
As a second reference we employ the Spatially Embedded Random Network (SERN) model to account for spatial embedding effects on the network topology [38,72]. Graphs generated Local Network Differences and Dynamical System Model Evaluation by this model have the same distribution of geographical link lengths as the respective TRMM graph. We have seen that the climate models basically reproduce those distributions with their network-type dependent character [cf. Fig. 5(c)]. The SERN model performance quantifies what could be expected from a correct such reproduction alone; the smaller the GHD between SERNs and the respective TRMM network, the more topological information on the latter is contained in its link length distribution. We generate an ensemble of 200 SERNs.
When comparing CCLM and ERAI performances with the average STARS performance, one should take into account a disadvantage the statistical model has ab initio. Its resampling algorithm does not preserve the time order of its input data, which implies that, even if STARS was fed with data from the evaluation period, its output would only have the true chronology with negligibly small probability. In contrast, CCLM in its present application and ERAI are aimed at reproducing the actual weather history of the evaluation period. Therefore, while some of the differences between TRMM and STARS networks are due to climate variability represented by the ensemble of climate realizations generated with the statistical model, deviations of this origin are precluded between the TRMM, CCLM and ERAI networks.
To estimate the fraction of network differences that arises from interannual climate variability, we introduce ensembles of networks from resampled TRMM, CCLM and ERAI precipitation time series. We resample according to the bootstrapping method described by Feldhoff et al. [8], i.e. seasonal blocks of data are drawn at random with replacement and concatenated such that the lengths of bootstrapped and original time series are equal. To ensure spatial coherence, every draw is done synchronously at all locations. We construct ensembles of each 200 TRMMb, CCLMb and ERAIb networks from the respective bootstrapped space-time series. This approach is inspired by STARS' resampling algorithm which, in its first step, does just such a bootstrapping with annual blocks of data [8, Fig. 1]. Fig. 7 reveals that performance differences between CCLM and CCLMb or ERAI and ERAIb are minor which shows that CCLM and ERAI global network differences to TRMM mainly reflect model deficiencies. Nevertheless, in the following we will compare STARS network errors to those of CCLMb and ERAIb only and thereby facilitate a fairer model intercomparison.
From the bootstrap network ensemble spread of GCDs and GHDs we can also learn something about the interannual variability present in each dataset. Irrespective of their type, networks from bootstrapped observational data feature spreads about twice as large as those from bootstrapped reanalysis data and even less variability is present in the CCLMb and STARS ensemble data. The SRp graph exhibits the smallest interannual variability of all simple graphs-a persistence that is arguably due to the predominance of short-range links in this network [73].
Coming back to the SERN model, we observe that its performance varies considerably between graph types. These variations are associated with differences between the respective geographical link length distributions [cf. Fig. 5(c)]. In case of the SRp graph, the distribution is highly informative about the network topology. It allows to infer that only short-range links exist in the network and that geographical neighbors are most likely also topological neighbors. Aside from the existence of several long-range links, this also holds true for the ESp graph, hence the superiority of the SERN model over the Erdős-Rényi model for these two graph types. Analogously, the contrastingly poor SERN model performance for the SRn graph is due to the flatness of the respective link length distribution.
Conspicuously, SERN and climate model performances vary quite similarly between graph types. Since SERNs and climate model networks have nothing in common but their link length distribution [cf. Fig. 5(c)], the similarity must have something to do with the latter. Yet the explanation of the performance differences between graph types cannot be the same for the random network model and the climate models since CCLM, ERAI and STARS place links in a manner that is clearly not random (cf. Fig. 6). We think that the differences in climate model performances between graph types can be explained by mean geographical link length differences between graph types in connection with the conjecture that in our networks long-range links are statistically more susceptible to misplacement than short-range links. In Fig. 8(b) we give a numerical validation of this conjecture for our SR networks; an analogous validation for the ES network is unfeasible at this point as no variance-stabilizing transformation is known for the ES. A heuristic motivation of the conjecture is possible with the help of an analog from psychology, regarding the network nodes in Fig. 8(a) as grown-up humans, for which, owing to the developmental process of perceptual narrowing [74], it is easier to discriminate between familiar than between unfamiliar types of perceptual stimuli: Because nodes j, k are more similar to the close node i than to the distant node l, it is easier for i than for l to distinguish between j, k. Thus, in terms of links, j-l is more likely to be confused with k-l than i-j with i-k. The connection between node similarity and link dissimilarity that is implicit to this analog is established via the statistical dependence measures SR and ES and constitutes the core of the conjecture.
In simple terms, the SRn graph is harder to reproduce than the other simple graphs because its links are longer. We propose to account for this purely geometric-statistical effect by relating climate model performances to the respective SERN model performance which nicely quantifies the effect. And just as H(TRMM, SERN) represents a worst case performance scenario, the GCDs and GHDs between TRMM and TRMMb networks constitute optimal performance limits for CCLMb, ERAIb and STARS. In the sense of [76], relating climate model performances to these benchmarks yields a more informative measure of the actual climate model skills when comparing performances for different network types. We thus define a model skill score taking Why in our networks long-range links are statistically more susceptible to misplacement than short-range links. (a) Schematic of a simple graph where the color of a node represents the time series behind it with similarity between colors corresponding to similarity between time series. Note that just like the networks studied, this schematic graph features roughly constant nearest geographical neighbor distances and decreasing node similarity with increasing geographical node distance. (b) Mean absolute differences between Fisher transformed correlation coefficients r ij , r ik for geographically close nodes j, k versus d ij , the geographical distance between nodes i, j. Since the estimation errors of the Fisher transformed rank correlations are valueindependent [41,42], plot (b) shows that with increasing d ij , correlation differences r ij − r ik for geographically close nodes j, k become less robust with respect to the estimation errors of r ij , r ik . As r ij − r ik becomes less robust, so does the relative ranking of r ij , r ik and, hence, the thresholding of r ij , r ik to a ij , a ik [Eq. (21)]. Therefore, in the SRn and SRp graphs, the misplacement of a long-range link [j-l versus k-l in (a)] is more likely than the misplacement of a short-range link [i-j versus i-k in (a)]. To give an idea of link misplacement likelihoods, probabilities of false relative rankings of r ij , r ik at different σ levels are marked in (b). The σ levels were calculated using a BART estimator for the effective sample size M 0 < M to account for autocorrelations in the anomaly time series [75].  (11) for SRn/SRp/ESp, respectively. Like Fig. 6, this suggests that the SRn graph is hardest to model, even after taking the link length effect into account. As per Welch's t-test [77], the average SRp and ESp skill scores are not significantly different at the 5% α level. Compared to the untransformed GHDs, the skill scores display considerably increased similarity between network types which means that large parts of the performance differences between network types can be attributed to type-specific network uncertainties due to climate variability and spatial embedding. A model performance intercomparison for the individual network types is permitted as all differences between ensemble mean GCD or GHD values are significant at the 5% α level according to Welch's t-test. We find that STARS performs worse than ERAI for every network type considered. The best models are ERAI for the edge-weighted SR network and the SRp graph, and CCLM for the SRn and ESp graphs. The poor SR performance of CCLM is in line with its abovementioned overestimation of absolute correlation values [cf. Fig. 5(b)].
Local network differences. We now come to the application of the new local network difference measures. LCDs, LHDs and DNDs between TRMM and model precipitation networks are depicted in Fig. 9. We do not show the LHDs between ESp graphs because they are dominated by the degree dependence which motivated the introduction of the DND, nor the DNDs between SRn graphs since they are undefined in many locations due to a multitude of isolated nodes in these graphs.
Starting with the LCD, we find virtually no spatial variability in C i (TRMM, TRMMb), which is due to the use of the Fisher transformation in the LCD definition [Eq. (11)] if we consider differences between TRMM and TRMMb correlations to quantify estimation errors. The contrasting spatial dependence of climate model LCDs hints at model deficiencies with spatially inhomogeneous consequences. Higher correlation differences common to all models can be observed along the Andes, in northeastern Brazil and over the Pacific south of 20°S while lower values occur over the Atlantic south of the SACZ. We observe LCD contrasts along the coasts and find a larger LCD spread across models over land than over sea. The largest correlation distances to TRMM are produced by CCLM over central Brazil.
The LHDs of SRn and SRp simple graphs are clearly degree-dependent (cf. Fig. 4). We see that high LCDs may coincide with high LHDs as for instance over the Pacific off the Chilean coast for STARS or over central and northeastern Brazil for CCLM. Yet where correlations are not strong enough to be represented by links in the simple graphs, high LCDs may come along with low LHDs as can be observed along the Andes.
Hatching in Fig. 9(e-t) visualizes the respective relative local SERN model performances. Light hatching indicates insignificantly different local network differences at the 5% α level according to Welch's t-test [77] while heavy (no) hatching indicates a significantly better (worse) local performance by the random network model. In line with GHD results (cf. Fig. 7), the climate models perform better in most locations. Yet there are areas where mere knowledge of the observed all-domain link length distribution allows for a better neighborhood prediction than the use of a climate model. With the LHD, these areas differ considerably between models.  (10)]. Note that the LHDs have been renormalized by (N−1)ρ, analogously to the degrees (Fig. 4). The color scale next to (p) is also applicable to (q-t). Hatching in (e-l) puts LHDs in relation to H i (TRMM, SERN) of the respective graph type; same for DNDs in (m-t); light hatching indicates insignificantly different values at the 5% α level according to Welch's ttest [77]; heavy (no) hatching indicates a significantly better (worse) local performance by the random network model. doi:10.1371/journal.pone.0118088.g009

Local Network Differences and Dynamical System Model Evaluation
A more coherent picture only emerges with the DND, shown for SRp and ESp graphs in Figs. 9(m-p) and 9(q-t), respectively. We observe larger DNDs over land than over sea and find particularly high values in the Guyana Highlands and along the Andes. DNDs between TRMM and TRMMb are much larger for the ESp than for the SRp graph. Climatic uncertainties of extreme event synchronizations are most pronounced in the Amazon basin and along the eastern slopes of the Andes, which is consistent with the outstanding noisiness of TRMM's ESp degree field in those areas [cf. Fig. 4(m)]. In line with the respective GHDs (Fig. 7), the ESp DNDs to TRMM increase slightly from TRMMb to CCLMb to ERAIb to STARS while their spatial patterns do not vary much across datasets. This shows that most of the differences between TRMM and climate model ESp graphs can be explained with interannual climate variability.
In this context, note that, generally, the local network difference fields based on the original CCLM and ERAI time series do not differ much from the respective bootstrap ensemble mean fields. The latter are smoother and feature slightly larger values but apart from that show the same spatial patterns. Hence, where climate model bootstrap network differences to TRMM clearly exceed the corresponding difference between TRMM and TRMMb, network imperfections cannot be explained by climatic uncertainty but must be due to model deficiencies.

Discussion
With the present study we advance the complex network approach to climate model evaluation recently put forward by Feldhoff et al. [8]. We do so in three ways, going from the general to the particular. Firstly, we expose the wider scope of application of the approach outlining how it can be used to evaluate models of real-world multidimensional dynamical systems of any background; secondly, we define local versions of the global network difference measures introduced by Feldhoff et al. [8] to facilitate spatially explicit network comparisons in the climate model evaluation context; and thirdly, we apply both local and global network difference measures in a sample climate model evaluation where we highlight intricacies of the approach related to comparisons of model performances across network types.
The first local network difference measure we define is a local version of the global Hamming distance (GHD) between simple graphs. We demonstrate that the local Hamming distance (LHD) comes with a disadvantageous degree dependence, which we propose to overcome using a suitable statistical null model, relating the actual LHD value to the null model probability distribution of possible LHD values given the degrees of the respective node. This way a second local network difference measure is defined which we call the degree-conditional neighborhood dissimilarity (DND).
Our ansatz to render the LHD degree-independent via a statistical null model goes beyond existing ad hoc renormalization methods used by the vertex similarity community to solve the equivalent problem occurring when comparing different nodes of the same network [21]. As our ansatz has the virtue of an explicit probabilistic motivation and is easily transferred to the equivalent vertex similarity problem, we propose it to supersede the ad hoc approaches common in that field.
By definition, the relationship between network LHDs and DNDs is similar to the one between absolute and relative precipitation biases. Large LHDs can coincide with small DNDs and vice versa. The two network difference measures should therefore always be considered conjointly to prevent misinterpretations. Notably, while zero LHDs imply zero DNDs, the converse is not true. Rather, a vanishing DND indicates the greatest possible neighborhood overlap.
We also examine the case of edge-weighted graphs represented by matrices of statistical relationship coefficients. It is argued that difference measures for such graphs should account for estimation uncertainties inherent to the coefficients. For this purpose we propose to employ variance-stabilizing transformations like the Fisher transformation in the case of correlations. Using the latter we define the local and the global correlation distance (LCD, GCD) between correlation matrices in formal analogy to the LHD and the GHD between simple graphs.
Generalizations of our local network difference measures to node-weighted graphs are presented as well. Computing LHDs and DNDs between directed graphs is straightforwardly done by separating ingoing and outgoing links, i.e. by defining H AE i and in turn D AE i via k AAE i ; k BAE i . The new difference measures are therefore highly versatile and can be used to compare networks of any functional or structural background as long as these have a common set of nodes. They might be useful additional tools for investigations of network evolution, such as in [73,78], or for studying the impact of disorders and disasters on network topologies, similar to [79][80][81][82].
In a sample climate model evaluation we apply the network difference measures to comparatively analyze statistical and dynamical regional climate simulations of the South American monsoon system. We focus on climate networks constructed from DJF precipitation spacetime series and use satellite data provided by the Tropical Rainfall Measuring Mission (TRMM) 3B42 V7 as a reference. We also include the ERA-Interim (ERAI) reanalysis data in the evaluation since they where used to drive the simulations of both the statistical model STARS and the dynamical model CCLM.
Different types of networks are constructed to capture distinct aspects of the spatiotemporal precipitation dynamics. Based on spatial rank correlations between anomaly time series we define an edge-weighted graph called SR and two simple graphs with links representing the 2% most positive and the 2% most negative correlations called SRp and SRn, respectively. Another simple graph called ESp is based on the 2% strongest spatial synchronizations between extreme event time series. We show that the ESp and SRp graphs are dominated by short-range links while teleconnections prevail in the SRn graph. Visualizing these teleconnections using an RGB color model we reveal that they represent the two major rainfall dipoles of the South American monsoon system.
Global distances between TRMM and ERAI/CCLM/STARS simple graphs vary strongly across network types, which allows us to uncover how climate variability and spatial embedding effects pose limits to the fidelity of network reproduction. It is shown that as geographical link lengths increase, so does the likelihood of their misplacement, which explains why GHDs are much larger for the SRn than for the SRp and ESp graphs. We employ spatially embedded random networks (SERNs) to model this effect. The fraction of network differences due to interannual climate variability is quantified using a bootstrap approach. In agreement with the fickle nature of extreme events, we find the ESp graph most affected by this source of uncertainty. Based on SERN and bootstrap network differences to TRMM, a model skill score is defined that accounts for the named uncertainties. According to this score, the SRp and ESp graphs are reproduced with similar and higher fidelity than the SRn graph. This result makes sense inasmuch as more than the former graphs, the latter contains nontrivial information about the system dynamics.
Comparing performances across reanalysis and climate models, we find that CCLM beats ERAI at two out of four network types, while STARS performs worse than ERAI in all cases. As both CCLM and STARS were driven by ERAI data, this suggests that dynamical downscaling may retain value at the resolution of the driving data while statistical resampling can only impair data quality. Interestingly, for the SRp graph, our evaluation against the independent TRMM data yields a relative performance ranking of CCLM versus STARS that is opposite to the one obtained in an evaluation against the driving reanalysis data [8].
From the ensemble spread of bootstrap network distances to TRMM we infer that ERAI and even more so CCLM and STARS underestimate the interannual variability of the precipitation dynamics encoded in our networks. As for the reanalysis, this finding complements documented deficiencies in reproducing the variability of spatiotemporal totals of tropical rainfall over land [83,84]. The loss of large-scale variability in the CCLM simulations might have been prevented by spectral nudging [85,86]. In case of STARS we conclude that its resampling algorithm tends to reduce the variability present in its input data. We suspect deficiencies in cloud physics parameterizations of contributing substantially to the variability underestimations by ERAI and CCLM as simulated precipitation characteristics over monsoon season South America are highly sensitive to modifications of these parameterizations [55] while they are known to lack flexibility in reacting to variations in the large-scale environment [83,87]. Another factor are model parameters that in reality vary from year to year but are represented by climatological values in ERAI [46,88] and CCLM [89,90], like aerosol concentrations, forest albedos or leaf area indices, which means that the models are unable to account for interannual variabilities in aerosol or land-atmosphere feedbacks due to, e.g. volcanic eruptions, bushfires or droughts.
Finally, our new local difference measures are employed to study the spatial distribution of network dissimilarities. In doing so, the DND proves to be most useful to identify commonalities across datasets. Larger DNDs over land than over sea are consistent with land surfaceatmosphere interactions being more complex than sea surface-atmosphere interactions, and relatively high DNDs in the Guyana highlands and along the Andes confirm model deficiencies in simulating precipitation over complex terrain [91,92]. Using the DND we also reveal that for the given observational record length, differences between ESp graphs are dominated by the interannual variability of extreme events, which prevents a reasonable evaluation of their spatial synchronizations at this point.
The situation is different for the correlation networks and we find model-specific spatial patterns in the LHD and LCD fields. The largest SR, SRn and SRp network distances of CCLM to TRMM are found over central and northeastern Brazil. We suppose these biases to be associated with a less severe underestimation of precipitation over this region relative to over the western Amazon basin, and with an erroneous northward displacement of the SACZ [55]. For STARS, the largest correlation network distances to TRMM occur over the Pacific off the Chilean coast. During DJF this region is controlled by a subtropical high that is occasionally disturbed by troughs carried along by the Westerlies of the Southern Hemisphere. Considering that STARS' simulation domain in fact extents to the southern tip of South America [8], we suppose that the statistical model inadequately resamples the synoptic features that govern the austral summer climate over the southern Pacific Horse Latitudes. A common deficiency of ERAI, CCLM and STARS is an underestimation of the relative strength of the maritime part of the SACZ precipitation seesaw. As both the reanalysis and the dynamical model employ prescribed sea surface temperatures [46,55], this might be due to an imperfect ocean-atmosphere coupling [93].
Supporting Information S1 Data. TRMM 3B42 V7 daily precipitation estimates over South America from 1998 to 2011, conservatively interpolated to the ERA-Interim grid.