Statistic Complexity: Combining Kolmogorov Complexity with an Ensemble Approach

Background The evaluation of the complexity of an observed object is an old but outstanding problem. In this paper we are tying on this problem introducing a measure called statistic complexity. Methodology/Principal Findings This complexity measure is different to all other measures in the following senses. First, it is a bivariate measure that compares two objects, corresponding to pattern generating processes, on the basis of the normalized compression distance with each other. Second, it provides the quantification of an error that could have been encountered by comparing samples of finite size from the underlying processes. Hence, the statistic complexity provides a statistical quantification of the statement ‘ is similarly complex as ’. Conclusions The presented approach, ultimately, transforms the classic problem of assessing the complexity of an object into the realm of statistics. This may open a wider applicability of this complexity measure to diverse application areas.


Introduction
Complex systems is the study of interactions of simple building blocks that result in a collective behavior or properties absent in the elementary components of the system itself. Due to the fact that this problem does not fit into one of the traditional research fields, it is connected to various of these, for instance physics, biology, chemistry or econometrics [1][2][3][4][5]. Many measures, properties or characteristics of a multitude of different complex systems from these fields has been studied to date [6][7][8], however, the complexity of an object may have received the most attention. This property of complex systems has fascinated generations of scientists [9][10][11] trying to quantify such a notation. Very coarsely speaking, an object is said to be 'complex' when it does not match patterns regarded as simple, as LÓ PEZ-RUIZ et al. [12] describe it in their article. Over the last decades, many approaches have been suggested to define the complexity of an object quantitatively [9,11,[13][14][15][16][17][18][19]. An intrinsic problem with such a measure is that there are various ways to perceive and, hence, characterize complexity leading to complementing complexity measures [20]. For example, Kolmogorov complexity [9,11,21] is based on algorithmic information theory considering objects as individual symbol strings, whereas the measures effective measure complexity (EMC) [17], excess entropy [22], predictive information [23] or thermodynamic depth [18] relate objects to random variables and are ensemble based. Interestingly, despite considerable differences among all these complexity measures M they all have in common that they assign a complexity value to each individual object x' under consideration, C M x' ð Þ. In this paper we will assume that x' corresponds to a string sequence of a certain length and its components assume values from a certain domain, e.g., A~0, 1 f g or A~0, 1 ½ . It is of importance to note that there is a conceptually different measure recently introduced by VITÁ NYI et al. that evaluates the complexity distance among two objects x' and x'' instead of their absolute values. This measure is called the normalized compression distance (NCD) [24], NCD x', x'' ð Þ, and is based on Kolmogorov complexity [10].
The purpose of this paper is to introduce a new measure of complexity we call statistic complexity that is not only different to all other complexity measures introduced so far, but also connects directly to statistics, specifically, to statistical inference [25,26]. More precisely, we introduce a complexity measure with the following properties. First, the measure is bivariate comparing two objects, corresponding to pattern generating processes, on the basis of the normalized compression distance with each other. Second, this measure provides the quantification of an error that could have encountered by comparing samples of finite size from the underlying processes. Hence, the statistic complexity provides a statistical quantification of the statement 'X is similarly complex as Y '. This paper is organized as follows. In the next section we describe the general problem in more detail and introduce our complexity measure. Then we present numerical results and provide a discussion. We finish with conclusions and an outlook.

Methods
Currently, a commonly acknowledged, rigorous mathematical definition of the complexity of an object is not available. Instead, when complexity measures are suggested they are normally assessed by their behavior with respect to three qualitative patterns, namely simple, random (chaotic) and complex patterns. Qualitatively, a complexity measure is considered good if: (1) the complexity of simple and random objects is less than the complexity value of complex objects [17], (2) the complexity of an object does not change if the system size changes. For example, Kolmogorov complexity has the desireable property to remain unchanged if the system size doubles, i.e., C K x ð Þ~C K xx ð Þ, however, it cannot distinguish random from complex pattern because in both cases the compressibility of an object is low resulting in high values of C K . We want to add a third property to the above criteria: (3) A complexity measure should quantify the uncertainty of the complexity value. As motivation for this property we just want to mention that there is a crucial difference between an observed object x' and its generating process X [23]. If the complexity of X should be assessed, based on the observation x' only, this assessment may be erroneous. This error may stem from the limited (finite) size of observations. Also, the possibility of measurement errors would be another source derogating the ability of an error-free assessment. For this reason, the major objective of this article is to introduce a complexity measure possessing all three properties listed above that assesses the complexity classes of the underlying processes instead of individual objects.
We start by pointing out that criteria (1) provides a relative statement connecting different objects. That means the complexity of an object is always related to the complexity of another object [20] leading to relative statements like 'X is similarly complex as Y '. Hence, a numerical value C X ð Þ without knowledge of any other complexity value for other objects has no meaning at all. For reasons of mathematical rigor, we propose to include this implicit reference point into a proper definition of complexity. This implies that a fundamental complexity measure needs to be bivariate, C X , Y ð Þ, instead of univariate comparing two processes X and Y . As a side note, we remark that all complexity measures suggested so far we are aware of are univariate measures [13,14,[16][17][18]22,23] with respect to the context set above, except for the normalized compression distance (NCD) [24,27]. However, a practical problem of the NCD is that Kolmogorov complexity, on which it is based, is not computable but only upper semicomputable [27]. LI et al. introduced in [27] a normalized and universal metric called NORMALIZED INFORMATION DISTANCE (NID) which can be approximated by, the NORMALIZED COMPRESSION DISTANCE [27]. Here, C x ð Þ denotes the compression size of string x and C xy ð Þ the compression size of the concatenated stings x and y. Practically, the quantities C() are obtained by compressors like gzip or bzip2, see [28,29] for details.
Criteria (3) of a complexity measure stated above acknowledges the fact that an assessment of an object's complexity cannot be without uncertainty or error in case only finite information about this object is available. That means, for a complexity measure to be applicable to real objects (rather than pure mathematical ones) it has to be statistic in order to deal appropriately with incomplete information. Based on these considerations, the statistic complexity measure we suggest is defined by the following procedure visualized in Fig. 1: 1. Estimate the empirical distribution functionF F X ,X (We indicate estimated entities byF F and refer to the ensemble by F .) of the normalized compression distance from n 1 samples, from objects x' and x'' of size m generated by process X (Here x*X means that x is generated (or drawn) from process (distribution) X .). 2. Estimate the empirical distribution functionF F X ,Y of the normalized compression distance from n 2 samples, from objects x' and y' of size m from two different processes, X and Y .

~p, as statistic complexity
This procedure corresponds to a two-sided, two-sample Kolmogorov-Smirnov (KS) test [30,31] based on the normalized compression distance [24,27] obtaining distances among observed objects. The statistic complexity corresponds to the p-value of the underlying null hypotheses, H 0 : F X ,X~FX ,Y , and, hence, assumes values in ½0,1. The null hypothesis is a statement about the null distribution of the test statistic T~sup x DF F X ,X x ð Þ{F F X ,Y x ð ÞD, and because the distribution functions are based on the normalized compression distances among objects x' and x'', drawn from the processes X and Y , this leads to a statement about the distribution of normalized compression distances. Hence, verbally, H 0 can be phrased as 'in average, the compression distance of objects from X to objects from Y equals the compression distance of objects only taken from X '. It is important to emphasize that this equality holds in average and, thus needs to be connected to two ensembles X and Y . If the alternative hypothesis, H 1 : F X ,X =F X ,Y , is true this equality does no longer hold implying differences in the underlying processes X and Y , leading to differences in the NCDs. From the formulation of the hypotheses, tested by the statistic complexity, it is apparent that we are following closely the guiding principle expressed by LÓ PEZ-RUIZ et al. [12] as cited at the beginning of this paper, because C S is intrinsically a comparative measure. As a side note regarding the choice of the null hypothesis we want to remark that substituting F XY with F YY may encounter problems in cases where the complexity value of objects in Y is systematically shifted compared to the complexity value of objects in X . In this case, the distributions F XX and F YY could be similar, although, the complexity of elements in X and Y are different. Practically, this may correspond to a pathological case rarely encountered in practice, however, conceptually, such a null hypothesis is apparently less stringent.
Regarding the notation and interpretation of the above procedure it is important to note the following. First, the entities x and y refer to values of the NCD. For example, x~NCD x', x'' ð Þ whereas x' and x'' are observable objects that are identically and independently (iid) generated from a process X , x', x''*X . Because x' and x'' are generated from the same process X , the resulting distribution function F X ,X is only indexed by this process. The y entities are obtained similarly, however, in this case x' and y' are objects generated from two different processes, namely x'*X and y'*Y . For this reason the distribution function is indexed by these two processes, F X ,Y . Second, we use the notation, x'*X , to indicate that x' is generated from a process X , but also that x' is drawn from X . The first meaning is clear if thinking of X as a model for a complex system, e.g., a cellular automata or a stochastic process. The latter emphasizes the fact that such a process, even if deterministic, becomes random with respect to, e.g., random initial conditions and, hence, effectively is a stochastic process. Third, for reasons of conceptual simplicity we require all objects to have the same size m. This condition may be relaxed to allow objects of varying sizes but it may require additional technical consideration. On a technical note, the above defined statistic complexity has the very desirable property that the power reaches asymptotically 1 for n 1 ?? and n 2 ?? [32]. This means, for infinite many observations the error of the test to falsely accept the null hypotheses when in fact the alternative is true becomes zero. This limiting property is important to hold, because in this case all information about the system is available and, hence, it would be implausible if for such circumstances no errorfree decision could be achieved. Formally, this property can be stated as p?0 for n 1 ?? and n 2 ??. Finally, we would like to note that despite the fact that statistic complexity is a statistical test, it borrows part of its strength from the NCD respectively Kolmogorov complexity on which this is based on. Hence, it unites various properties from very different concepts.

Results
In the following we provide different numerical examples for data frequently used when studying complexity measures. This allows a direct comparison of ours with different measures.
The first characteristic of the statistic complexity we study is the influence of the size m of objects on C S . Table 1 shows the results for comparing patterns generated by different rules of one-dimensional cellular automata. Column one represents the reference process, X , and column two corresponds to Y . The third and fifty column shows the averaged p-values obtained for cellular automata of length T~100 respectively T~200 -column four and six provide the variances for the corresponding p-values. For the simulation results shown in Table 1 we generated spatiotemporal patterns for one-dimensional CA for N~50 (space) and T (time), an alphabet of size k~2 and a r~1 neighborhood with periodic boundary conditions. As burn-in time we used t trans~1 000 time steps. Each of these spatiotemporal pattern S ij , with i[ 1, . . . ,T f gand j [ 1, . . . ,N f g , is transformed to its difference Table 1. Results for one-dimensional CA (t trans~1 000, N~50, T~100 (third and fourth column) and T~200 (fifth and sixth column)) averaged over 10 runs.
gcorresponds to a row vector of length N.) resulting in a string (object) of length m~NT to be applicable for the NCD. Here, the operator z means concatenation of strings. See [29] for numerical details for the application of NCD. The results in Table 1 show that the p-values remain in the same order of magnitude if the size of an object m is doubled meaning that the overall quantitative assessment of two processes X and Y -based on sampled objects thereof -by the measure C S is invariant to extensions of the size m. Next we demonstrate that the statistic complexity is capable to differentiate between random and complex objects. For this reason we compare rule 30, producing random patterns, with rule 90, 225, both random, and rule 110, which is complex because it is capable of universal computation. From Table 1  ð Þ. In addition we compare rule 30 with rules 73,54 and 22, classified according to Wolfram as random, and obtain very low p-values, suggesting significant differences among those patterns. The crucial point here is that not all CA rules that produce chaotic patterns are indistinguishable from each other. In [33] the growth exponent of the roughness along other measures have been used to obtain several subclasses for CA rules leading to chaotic behavior. Comparing our results with their classification reveals that actually rule 73,54 and 22 are in different subclasses whereas rule 30 is classified together with rule 90 and 225. Last, we compare rule 30 with a periodic pattern, rule 33, and obtain also in this case a clear distinction. In summary, C S can not only distinguish between simple and complex patterns but finds also meaningful substructures among chaotic patterns if rule 30 is used as reference process.
Next, we apply our measure to the logistic map and compare the results with the Lyapunov exponent (l). The results are summarized in Fig. 2. We calculate the time series for various values of r (x-axis) in the intervall ½3:8, 3:9 (r was varied in step sizes of 0:001 and sample size was n 1~n2~6 .). l assumes negative values in ½3:829, 3:849 and ½3:856, 3:856 indicating a nonchaotic behavior of the logistic map for these values of r. The vertical dashed line separates positive from negative values. The p-values of the statistic complexity (blue line, cross symbols) are obtained for each value of r by averaging over 50 time series each of length 1000 (After waiting a transient period of 1000 steps.). As reference process, X , we use a logistic map with r ref~3 :451, which corresponds to a periodic behavior. From Fig. 2 one can see that there are essentially two types of p-values, ones that are not zero and ones that are close to zero. For example, using a significance level of 0:01 (dotted horizontal line) one obtains that significant values correspond to positive Lyapunov exponents and nonsignificant values to negative Lyapunov exponents. Again, we want to emphasize that the p-values do not provide a yes or no answer if the logistic map, for a given r value, is chaotic or nonchaotic but the correct interpretation is that low p-values provide strong evidence against the null hypotheses whereas high p-values do not allow to reject the null hypotheses. Because we use r ref~3 :451 as reference -for which the logistic map shows periodic (nonchaotic) behavior -this is a similar though not identical question. The results for the logistic map allow a comparison with a well studied system. As demonstrated by our results shown in Fig. 2, for an appropriately chosen reference process, X r ref À Á , there is a clear correspondence between the statistic complexity and the Lyapunov exponent. This property is certainly desirable to hold because it may allows to connect to traditional contributions in the field beyond the logistic map. The possibility of such a connection, despite the seemingly different methods underlying the statistic complexity respectively the Lyapunov exponent, can be attributed to the parametric form of our complexity measure allowing a flexibility that is entirely missing in other measures. More importantly, this flexibility is not imposed into the measure but follows naturally from a consequent interpretation of complexity as a referential measure [12] implying imperatively the existence of a reference process X against which another process Y is quantitatively compared.

Discussion
The complexity measure introduced in this paper has several properties that are different to all other measures proposed so far. First, C S is a bivariate measure allowing to make comparative statements, instead of absolute ones. This may appear as a disadvantage first, however, as LÓ PEZ-RUIZ et al. [12] point out, we inevitably compare patterns with each other to make a decision about their complexity (See also the comparative discussion on page 909 in [17] about the three patterns shown in Fig. 1.) [20]. Second, we do not make assumptions with respect to the size of patterns to which our measure can be applied, instead, principally, we allow patterns of any finite or infinite size m. For example, measures like EMC or excess entropy are based on block entropies of varying order n and the final measure is obtained in the limit for n against infinity. Strictly, such measures require an infinite amount of data. Third, due to the fact that statistic complexity allows the comparison of patterns of any size m with finite sample sizes n 1 and n 2 the result of the comparison may be erroneous. The KS test, underlying C S , allows a quantification of such an error statistically. Because this error can be quantified in dependence on m, n 1 and n 2 , there is no need to assume limiting properties. At this point we would like to re-emphasize that the term statistic complexity has been chosen to underline the involvement of a test statistic in our measure on which the complexity value is based. For this reason other complexity measures that have been named statistical complexity [12,34,35] are not similar to our measure at all due to the fact that none of these measures uses a test statistic or a statistical test. Hence, they are actually not related to statistics (the field). An alternative name for these measures would be probabilistic complexity, which would make this difference more obvious. The fourth point relates to the empirical distribution functions. The reason for their introduction is, besides the fact that they allow a connection to the KS test, they allow the introduction of two ensembles, one for the process X and one for processes Y . These ensembles compensate that the classic KOLMOGOROV complexity is not related to any ensemble but only to one string. Further, the ensembles induce a probabilistic interpretation of the deterministic NCD with respect to the underlying processes that generate the patterns. This is in accordance with [17] emphasizing the importance of complexity measures being probabilistic. Taken together, this allows a quantifiable approximation, in dependence on m, n 1 and n 2 , of the underlying processes X and Y with respect to the information they provide about their complexity, in form of the real observable patterns.
From an applied point of view, the direct connection of statistic complexity with statistical inference allows a confirmatory analysis of the complexity of objects. Due to the fact that the uncertainty of a complexity comparison is inherently provided by our measure, it is applicable to (real) objects from a multitude of different application domains. In the future we are planing to investigate the complexity of biological pathways in the context of cancer and other complex diseases [37]. A further potential direction would be an analysis of different goodness-of-fit tests. For example, it would be interesting to study a Cramér-von Mises or an Anderson-Darling test, instead of a Kolmogorov-Smirnov test [36]. Other tests may have advantages in different application areas or specific experimental conditions, although, a Kolmogorov-Smirnov test was sufficient with respect to the applications studied in this paper.