Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Equivalence of information production and generalised entropies in complex processes

  • Rudolf Hanel ,

    Contributed equally to this work with: Rudolf Hanel, Stefan Thurner

    Roles Conceptualization, Formal analysis, Methodology, Writing – original draft, Writing – review & editing

    rudolf.hanel@meduniwien.ac.at

    Affiliations Section for Science of Complex Systems, CeMDS, Medical University of Vienna, Vienna, Austria, Complexity Science Hub Vienna, Vienna, Austria

  • Stefan Thurner

    Contributed equally to this work with: Rudolf Hanel, Stefan Thurner

    Roles Funding acquisition, Methodology, Writing – review & editing

    Affiliations Section for Science of Complex Systems, CeMDS, Medical University of Vienna, Vienna, Austria, Complexity Science Hub Vienna, Vienna, Austria, Santa Fe Institute, NM, Santa Fe, NM, United States of America

Abstract

Complex systems with strong correlations and fat-tailed distribution functions have been argued to be incompatible with the Boltzmann-Gibbs entropy framework and alternatives, so-called generalised entropies, were proposed and studied. Here we show, that this perceived incompatibility is actually a misconception. For a broad class of processes, Boltzmann entropy –the log multiplicity– remains the valid entropy concept. However, for non-i.i.d. processes, Boltzmann entropy is not of Shannon form, −kipi log pi, but takes the shape of generalised entropies. We derive this result for all processes that can be asymptotically mapped to adjoint representations reversibly where processes are i.i.d. In these representations the information production is given by the Shannon entropy. Over the original sampling space this yields functionals identical to generalised entropies. The problem of constructing adequate context-sensitive entropy functionals therefore can be translated into the much simpler problem of finding adjoint representations. The method provides a comprehensive framework for a statistical physics of strongly correlated systems and complex processes.

Introduction

To know the information content of a process, a system, a source, a signal, or a sequence, one uses entropy to quantify it. If systems or processes are independent identically distributed (i.i.d.), ergodic and stationary in their probabilities, it is known what to do: one uses the expression [1], (1) where i = 1, 2, …, W are the states the system can take and pi is the probability to observe them. So-called Shannon entropy is given by S(p) = kH(p), where k > 0 is a constant that specifies the units of entropy, in statistical physics k = kB is the Boltzmann constant; if we measure information in bits per symbol, k = 1/log 2. If the system or process of interest is not i.i.d., ergodic, or in stationary equilibrium, then it becomes less clear what to do in order to obtain its correct information content. In principle there are two conceptually very different paths to solve the problem.

The first way (which we present here in this paper) is to look at the information production of the process or system. Consider a process, X, that emits signals x(T) of length T with x(T) = xTxT−1x1. Such an ordered list/sequence of elements x(T) is often referred to as a T-tuple (or T-gram). Every symbol, xt, in the signal is an element of a fixed sample space or an alphabet, , that contains all possible states xt can take. For example, if you think of X as a text-producing author, states are the letters from the English alphabet, , or in a “binary alphabet”, of zeros and ones, once the text is stored on a computer. In non-i.i.d. systems, symbols within sequences will in general be correlated in one way or another. Those correlations –that may extend over many different scales in the system– carry information about the system. Using the marginal distribution functions of the occurrences of states (letters) in Eq (1) will then certainly not provide the correct information content of the process. However, if we know the probability distribution, pT, to observe entire sequences, x(T), we compute the information production of process, X, as (2) see also Text 2 in S1 File for details. This changes the perspective from individual symbols, events, or states to entire sequences, or paths. From an information theoretic point of view, I(X) measures the average number of bits required to reversibly encode samples of X into bit-streams that can be sent through information channels and measure information production in bits per emitted symbol, i.e., k = 1/log(2). Note that kH(pT) measures bits per T-tuple, i.e. per path-segment-of-length-T. Therefore, by using kT = k/T, one measures again bits per emitted symbol of the original alphabet. However, the number of all possible T-tuples –the size of the new “alphabet”– is enormous if T is large. If the sample space contains symbols, the size of the alphabet for all paths is AT. The sum over all states and knowing their probabilities, pT, will in general be impossible.

The reason why I(X) is the true information production rate of process X, is because there exist representations of X in terms of other processes, Y, such that sequences, x, presented in the symbols from the initial alphabet, , can be rewritten into sequences, y, using other symbols from a much larger alphabet. The definition of information production, I(X), asymptotically uses the largest alphabet containing each possible sequence as a unique symbol, which –as a consequence– are statistically independent. In other words, all structure (correlations) gets absorbed into new symbols belonging to an extended alphabet. Only, the definition of information production uses the largest possible alphabet; the alphabet of T-tuples over the original alphabet and T → ∞.

However, typically, there exist much smaller alphabets that can capture all structures of a process or system. We say Y is adjoined to X (see subsection Constructing adjoint process spaces for details) if Ys symbols are uncorrelated and, thus, H measures the correct information content. In general, the symbols, z, in the extended alphabet, encode for a number, , symbols in the original alphabet. For instance, if z is a symbol in the alphabet of T-tuples, then T = (z), z consisting exactly of T letters of the original alphabet. The average length, , of original symbols emitted per symbol in the extended alphabet increases with the size of the extended alphabet. fz = pz(Y) is the distribution function of letters z. Consequently, the unit of information, k, adapts to the “complexity” (z) and .

For an example of how one can encode information about correlations of a process X on all relevant scales imagine an initial alphabet of Latin letters and extend it to a series of extended alphabets: one that contains syllables in addition to letters, one that adds word-fragments, one that includes words, one with frequent word combinations, one with phrases, and so on. This sequence of alphabets is nested in the sense that they contain each other, . With any one of these alphabets, say , one can sample “text” from it by using the associated marginal distributions, pλ(n), of its elements, λ. With increasing n, the resulting artificial text samples will more and more resemble the English text body from which the marginal distributions pλ(n) were derived; see Text 1 in S1 File for the examples given by C. Shannon. We can find particular sequences of alphabets such that each alphabet, , contains exactly one more symbol than , by applying reversible substitutions of symbols, which we refer to as “parsing rules”. For details, see Text 3 in S1 File.

The second way is to directly capture the correlations and structures in non-i.i.d. systems / processes in a generalised functional form of the entropy, which –typically– looks more complicated than Eq (1), an approach that so far has been successful for a handful of processes. Generalised entropy functionals are usually expressed in terms of marginal distributions in the “original alphabet”. These generalised entropies have been extensively studied for several decades [26] from different angles, generally for systems with strong or long-range correlations [79], that are non-ergodic, internally constrained [1012], or for systems out of equilibrium [1315].

For non-i.i.d. systems or processes it is essential to specify the context in which the term entropy is used, whether one talks about information theory, thermodynamics, or the maximum entropy principle (MEP) [16]. While thermodynamic aspects of such systems, especially the existence of well defined thermodynamic potentials (and as a consequence, temperature) is heavily debated, there is wide consensus that entropy production (the physical analogue of information production) remains a valid concept, also for these systems. For thermodynamic considerations away from equilibrium and i.i.d. and applications e.g. to neural correlations see for instance [1719].

Here we will not focus on thermodynamic aspects of entropy, but on the original context envisioned by Boltzmann: its power to predict a particular macro state from knowing the number of micro states (multiplicity) corresponding to it. This allows one to predict typical distribution functions and derive functional relations between macro state variables (expectation values of respective observables). This view that is tightly related to the MEP is not restricted to physics and is not limited to i.i.d. processes. For specific cases the respective MEP functionals, the generalised entropy and cross entropy, have been explicitly constructed [5, 6]. In particular, if multiplicity and probabilities are multiplicatively separable in the assymptotic limit, [5], a clear definition of cross entropy is possible also for non-i.i.d. systems.

While the Boltzmann entropy concept remains untouched (log multiplicity) its functional form, i.e., the generalised entropy functional, depends on the context of the process, X, and the process class, Φ, to which it belongs to. Some examples for different process classes include i.i.d. processes, exchangeable and polynomial mixture processes [20], Polya processes [21], sample space reducing processes [6], and processes describing structure forming systems [22].

The idea behind generalised entropies is to quantify entropy as the logarithm of the number of micro states (multiplicity) of process, X. It is based on the marginal distribution function, gi = pi(X), of symbols i from a sample space (here the original alphabet is ) that compose a functional, , such that it captures the structural information in X. Similarly, one obtains generalised expressions for cross entropy and information divergence. Those functionals effectively capture arbitrarily complicated relations between symbols of the sample space (original alphabet) in terms of marginal symbol frequencies, g, of the process X. Generalised entropies (that fulfil the first Shannon-Khinchin axiom) do not explicitly depend on system parameters that identify a process within a process class or other details. It is obvious that in general, constructing such a functional may be complicated and has not been achieved convincingly, except for a few exceptions, e.g. [6, 21]. As we will see, one can reconstruct (or at least approximate) such functionals from data—at least in principle, since there exist fundamental limits to reconstructing generative grammars from data on the basis of statistical inference alone, [23, 24], a fact also captured in Chaitin’s incompleteness theorem, [25]. In other words, the question of whether some data, a particular sequence, x, contains regular structures that can be used to compress it, may become undecidable.

Here we show that the two approaches, information production and the generalised entropy functionals can be mapped to one another, meaning that they are the same. The diagram in Fig 1 schematically shows the basic idea: We first use the method of parsing rules to construct an adjoint representation, Y, of a given process X, and write Y = πX. Here π is a map that reversibly encodes all structures in X, such that process Y in its new extended alphabet is i.i.d. and is therefore fully described by their marginal distribution function, fz = pz(Y), where z is again a letter from the extended alphabet. Consequently, the Shannon information measure with the appropriate unit of information is adequate (first way). Next, we project the marginal distributions from the adjoint representation, fz, to the original alphabet and, in a last step, we identify the process specific “pull-back” information measures, SX, , and DX, which get precisely defined in Eq (14) below, that takes distribution functions over the original alphabet, with the corresponding generalised information measure (second way) by adequately lifting distribution functions over the original alphabet to distributions over the adjoint alphabet.

thumbnail
Fig 1. Diagram of the relations between of distribution functions and entropies over the sample space and adjoint samples space.

We consider a process, X over alphabet with adjoint process Y = πX over the adjoint alphabet . Y is i.i.d. and therefore fully characterized by the marginal distribution of letters it samples from, i.e. asymptotically data y = πx is fully characterized by the relative frequency distribution function f = p(y). Φ* is the set of all i.i.d. processes over . Therefore Φπ = π−1Φ* is the class of processes that naturally generalize X ∈ Φπ. f can be projected to the marginal distribution function g = p(x) = π*p(y) = π*f. Conversely, for a particular process, X, we can lift the distribution function g to the associated adjoint distribution and get . Since Y is i.i.d. over the adjoint sample space one can measure information production simply by using Shannon entropy with the adequate Boltzmann factor kY (adapted to the distribution function f). The commutative diagram therefore defines the process specific generalised entropy SX over the sample space of the process class Φπ = π−1Φ*.

https://doi.org/10.1371/journal.pone.0290695.g001

For the proof we use the minimal description length (MDL) (see also Text 2 in S1 File), the length of the shortest encoding that fully represents the data. In this context, we shall see that information production is tightly related to the notion of Kolmogorov complexity [2527]; For a brief discussion, see Text 4 in S1 File. We explicitly demonstrate the method in an example for the class of sample space reducing (SSR) processes [21] that provide simple, analytically tracktable models for driven dissipative systems, that typically exhibit distribution functions that are power laws. Their generalised entropy is exactly known for arbitrary driving rates [6, 21].

The purpose of the paper is to show that indeed, SX(X) = kYH(Y), represents generalized entropies in the second way. The proof is given constructively in the following section.

Results

The key tool used in the following are simple substitution rules, parsing rules, that allow us to reversibly re-code (possibly correlated) data streams into new symbol streams that no-longer carry structure; see Text 3 in S1 File. The structure of a parsing rule we refer to as a template. A particular substitution rule derived from a template we refer to as a parsing rule.

The simplest parsing rule template, that we refer to as the elementary template, can be denoted by [r sm], meaning that two symbols r and s that appear together are substituted with a new symbol m. In the following we will associate m also with the symbol index, i.e. it is the m-th symbol in an alphabet.

To recode data one may choose a suitable set of parsing rule templates. We speak of a relevant set of parsing rule templates if (i) one can extract the full information content of a process, X, asymptotically, solely by using parsing rules from the set of templates, and (ii) if omitting one template from the set does not allow one to do so. In the following we focus on processes for which the elementary parsing rule template forms a relevant set. However, the arguments presented here extend naturally to more general sets of parsing rule templates; see Text 3 in S1 File.

A generalization of the framework we present from categorial to continuous random variables can be considered via first coarse-graining the continuous dynamics and then taking the limit to ever finer scales of coarse graining, however at considerable measure theoretic cost. Genuinely reformulating the theory in a framework of path-integrals will mainly be limited by translating the concepts of alphabet and parsing rules, which are inherently discrete objects, into a framework of continuous variables.

Constructing adjoint process spaces

Decidability issues may forever limit our ability to design algorithms for reconstructing optimal adjoint representations from data that are not context specific. On the other hand, for any finite data over finite sample spaces and finite relevant set of parsing rule templates one can in principle find the optimal proxy to an adjoint representation by extensive search algorithms, even though time issues forbid such extensive searches in practical implementations. The general procedure underlying the construction of adjoint representations always remains the same.

Suppose X is a process that emits symbols i drawn from the alphabet , where W is the number of symbols. X generates data streams, x(t) = xtxt−1x1, where every xt is one of the available symbols in , which contains W elements. Consider now two letters r1 and s1 such that the pair r1s1xτxτ−1 for some positions τ in the data x, has been identified to contain relevant information (e.g. because the pair is over-expressed), then we can rewrite the pair r1s1 by a new letter m1 = W + 1, which will become the first letter extending alphabet to , with the parsing rule π1 = [r1 s1W + 1].

We can iterate and produce parsing rules πn = [rn snW + n], with letter indices rn < W + n and sn < W + n. Where πn maps data over the alphabet to data over . Note that the parsing rules πn can be uniquely inverted, i.e. we can expand data over to data over using the inverse map . In other words the inverse parsing rules can be thought of being part of a “generative grammar”, [23]. We therefore can construct a sequence of maps π(n) = πnπn−1π1 such that data x can be mapped to representations yn = π(n)x. At every parsing level, n, we get a corresponding distribution function of the re-coded data, pz(yn), with a letter index z = 1, 2, ⋯, W + n.

The Kraft and McMillan theorems [28, 29] tells us that if all that we know about a process are its marginal relative frequencies, gi, at which symbols i occur, there exists a shortest reversibly encoding of the data, x, of characteristic length, , the minimal description length (MDL) that gives the theoretically achievable minimal length of x (in units of bits). is a lower bound for the true MDL, L(x), that can only be attained asymptotically. The theorems state that (3) with k = 1/log(b), where b is the basis in which information is measured. For bits we typically have b = 2. kH(g) is the MDL in bits per symbol and is the minimal number of bits required to encode messages of length t.

For data x(t) of length t we find a sequence of representations yn(t) = π(n)x(t). Suppose now that for every t we find a parsing level n*(t) such that yn*(t)(t) is a representation of the data x(t) that cannot be distinguished from an i.i.d. process (which we indicate here by a *). It follows that yn*(t)(t) is entirely determined by its marginal distribution of letters pz(yn*(t)(t)) and obtain asymptotically , where ≃ means asymptotically identical. With |.| we indicate the length of the sequence of letters in numbers of letters of the underlying alphabet. For instance we have |x(t)| = t and |yn+1(t)| ≤ |yn(t)|. Then, we can asymptotically measure the information production of the process X to be (4)

As discussed above, the “complexity” of a symbol (z) is the number of letters in the original alphabet it codes for. As a consequence we can compute the average symbol complexity for data yn*(t)(t) to be given by . As a consequence, we get for the adjoint process Y = limt → ∞Yn*(t) that . The adequate unit of information, kY, for measuring information production, kYH(p(Y)), therefore is given by (5)

For simplicity, suppose there is a maximal n* that holds for all t. Since we assumed that yn* is already indistinguishable from an i.i.d. process, applying another parsing rule would only compress the data without changing its MDL. This means, . If, conversely, we look at a parsing level, n, where the adjoint process is not yet i.i.d., then we can find a parsing rule, πn+1, such that . Additional knowledge always reduces the attainable information production rate.

In principle, for any finite amount of data x(t) one can construct the optimal map, π(n), for the process, X, by minimizing over all possible sequences of parsing rules, at any fixed parsing level n, if only we know the relevant set of parsing rules to consider and this set is finite. Then we can in principle find n = n*(t) such that no further reduction of the minimal description length is possible by applying any more parsing rules. In practice, an extensive search over all possible sequences of parsing rules is of course not feasible, even if the set of parsing rule templates only consists of the elementary template, and algorithms for inferring adjoint representations of data need to turn to different means of optimization. For theoretical considerations we may, however, assume that for a given finite relevant set of parsing rule templates and any finite t we can find the optimal map π (or one of several possible optimal maps if the map is not unique) or at least a map reasonably close to optimal, since the number of possible maps we would have to evaluate remains finite, too. Intuitively it is clear however, given an unknown process X for which we cannot pre determine the respective relevant set of parsing rule templates, that one typically can no longer decide whether a map π is optimal or not.

However, given an optimal π, the adjoint i.i.d. process, Y, is fully characterized by its marginal distribution, f = p(Y), over symbols in the adjoint alphabet, , and the information production, I(X) = kYH(f), is given by the Shannon entropy of Y. On the adjoint process space Φ* we can use the measures of Shannon entropy, cross-entropy, and Kullback-Leibler information divergence, given that we use the appropriate unit of information, kY of Eq (5). Since Y is i.i.d. over , the adjoint space naturally belongs to the family, Φ*, of all i.i.d. processes over this alphabet. Further, any Y′ in Φ* is fully characterized by its marginal distribution, f′, and the pair (f′, π), determines the process X′ = π−1Y′. Hence, the process class, Φπ, that naturally generalizes a process, X, with adjoint i.i.d. process, Y = πX, is given by Φπ = π−1Φ*.

This construction completes the first part of the proof that establishes that we can essentially measure information production of a process as the Shannon entropy of the adjoint process. This entropy, however uses the marginal distributions over the adjoint alphabet as arguments and can therefore not be identified directly with the generalised entropies that use the marginal distributions over the original alphabet as arguments. In the next step we will pull the information measures over the adjoint message space back to to the original message space.

Information measures over extended alphabets

Suppose we have a process X with an adjoint process Y = πX. For data, x, and its adjoint sequence, y = yn, we obtain two histograms, hi(x), of symbols and, hz(y), of symbols , respectively. The associated relative frequency distributions are given by g = p(x) = h(x)/|x| and f = p(y) = h(y)/|y|. Further, every symbol, z, represents a number of (z) symbols, π−1z, in the original alphabet with π = π(n). We define as the histogram of letters that are parsed together into the symbol . For zW where , we have, and needs to hold for all . This provides us with constraints, (6) that we need in the next section. As a consequence, we have (7)

We drop the arguments (x) and (y) (or (yn)) from now on and distinguish histograms by their index i (over ) and z (over the adjoint alphabet ). Let (note that in this notation we identify 〈f ≡ 〈Y) and be the expectation values under the distribution f, then, by construction, , , and . This means that we can write the constraints that link distributions g over , Eq (6), with distributions f over in the following way (8)

As mentioned before, the process class, Φπ, that X belongs to, is completely determined by the map π and the process X, by the pair (f, π), see Fig 1. Therefore, we can identify the entropy of X with (9) with the process-specific Boltzmann factor, kYk/(〈f). For processes, X and X′, and with f′ = p(πX′), the cross-entropy and the information divergence are (10)

In the special case where X is already an i.i.d. process, no features can be extracted from the data and n = 0, π = π(0) = id, and (z) = 1, for all z = 1⋯W. Consequently, Sπ = kH, , and (Kullback-Leibler divergence), as required.

Pulling back entropies to the original alphabet

In the next step one can construct entropy functionals over the original alphabet, , by lifting a distribution function, g′, on to a distribution function, f′, over by assuming that f = p(πX) is the true distribution function of the process, Y = πX. We proceed by minimizing the information divergence, Dπ(f′||f), with respect to f′. More precisely, we minimize the functional ψ(f′, α, η) given by (11) with Lagrange multipliers, α and ηi, that normalize f′ and guarantee the constraints from Eq (8). Solving the variational principle δψ = 0 estimates f′ at the minimum that is compatible with g′. We identify this minimizer as and obtain α = 1/〈f and (12) which has to be solved self consistently. If f already meets all matching constraints with g′, i.e. if g′ = g with g = p(X), then we have and .

This means that one can lift marginal distributions, g′, on to distributions, , on with respect to a particular process, X. As a consequence one can pull the entropy, cross-entropy, and divergence back from distributions, f, over the adjoint sample space, , to distributions, g, over the initial alphabet, , with respect to a particular process, X ∈ Φπ. In particular, one can define the projection operator, π*, through g = p(π−1y) ≡ π*p(y) = π*f and the operator, , that lifts distributions, g, over alphabet, , to distributions f over the extended alphabet, , through (13) with respect to the process X, i.e., with respect to the distribution function, p(πX), of the i.i.d. process, πX, over the adjoint alphabet. The lift operator gives us the minimizer . We find that and identify (14) where, typically, g = p(X). We call those measures the pull-back entropy, cross-entropy, and information divergence of the process X. Note, that while Sπ, , and Dπ are universal on the entire class of processes Φπ, pulling the information measures back to marginal distributions g over yields information measures that are specific to a particular process X ∈ Φπ.

Generalised entropies over initial alphabets

The final question is how the pull-back measures SX, , and DX, defined in Eq (14), are related to generalised entropy functionals as derived for example in [6, 21]. There, functionals were derived to obtain the most likely histogram, h, observed in a given process after t observations (for large t). Even for non i.i.d. processes, often the probability, P(h|θ), to observe the particular histogram, h, for t = ∑i∈Ω hi observations factorizes P(h|θ) = M(h)G(h|θ) into a multiplicity, M(h), and a probability term of the sequences, G(h|θ). θ is a set of parameters that determines the process, X—it defines and parametrizes the process class X belongs to. Whenever such a factorization is possible, one can show that a generalised maximum entropy principle exists. Using the Boltzmann definition of entropy, the logarithm of multiplicity, , and defining , and a generalised information divergence as , where g = h/t, the standard relations remain valid [5].

If we are looking at a family of processes X(θ) rather than a single process X then we can no longer assume a priorly that the same map π(1) that takes some process X(θ(1)) to an adjoint representation efficiently is the same map π(2) that takes some process X(θ(2)) from the family efficiently to an adjoint representation. However, there are ways in which we can think of constructing a single map π to some extended alphabet such that all processes X(θ) decorrelate under the same map π. In the weakest version we can start with some extended alphabet and the space of all i.i.d. processes Φ* over that alphabet and a map π and consider the process family Φπ = π−1Φ*, in which case the natural parameters of the processes X(θ) are given by θ = f, where f are distribution functions over . For a discussion of stronger versions see Text 6 in S1 File.

However, if above assumption is valid, then X(θ) forms a sub-class of processes in Φπ with distribution functions, g(θ) = p(X(θ)), and adjoint distribution functions, f(θ) = p(πX(θ)). similarly we have , y′ = πx′ and . And in the next step we note that the probability of observing a histogram h′ under process X(θ) is given by (15) with being the respective maximizing argument h*, the histogram over the extended space. The expression are the constraints between the histograms on the original and adjoint alphabet. Since y are sampled i.i.d, in the last line (16) is the multinomial distribution of histograms h*. Also note that in the one but last line of Eq (15) ≃ stands for asymptotically equivalent, meaning, that the relative error we make in log P(h′|θ) by replacing P(h′|θ) with P(h*|f(θ)) vanishes in probability as t → ∞. For more details see Text 7 in S1 File. We note, that this asymptotic equivalence is a consequence of processes that can be mapped into an adjoint representation becoming i.i.d. there and therefore implicitly can be thought of being ergodic on sufficiently large time scales.

Taking logs and multiplying both sides with −k/t and setting g′ = h′/t and f* = k*/|k*|, we then obtain (17) where ≃ again means asymptotically identical for large t; from the second to the third line we used the definition of the lift operator from Eq (13) and DX from Eq (14). In other words, we have shown that in the limit t → ∞ the generalised information divergence D(g′|θ) is identical to the pull-back divergence DX(θ) (g′|g(θ)) as a functional. As a consequence we can use that and identify SX with the generalised entropy and with the generalised cross entropy. We see that the generalised entropy, , for the processes family, X(θ), is given by .

We conclude that for all process classes (at least those that that decorrelate over a common map π) there exist notions of entropy, SX, cross-entropy, , and divergence, DX, that behave in the usual way, namely, . This means that for such processes (at least asymptotically) the probability to observe a particular histogram, P ≃ exp(−tDX), factorizes into a multiplicity term, M ≃ exp(tSX), associated with entropy, and a sequence probability term, , associated with cross-entropy. In other words, we have shown that Boltzmann’s entropy, the logarithm of multiplicity, remains the correct estimator of information production. For complex systems the multiplicity will differ from a multinomial coefficient in arbitrarily complicated ways that might even depend explicitly on system parameters θ, which would mean a violation of SK1 axiom. In other words, the Boltzmann pull-back entropy functional typically will be more complicated than the Shannon entropy functional and can even violate SK1—yet they are the appropriate generalizations of entropy, in terms of information production. We also learned that beyond such parametric families of generalised entropies, i.e. beyond the pull-back measures, we find the standard notions of entropy, cross-entropy, and divergence present in the adjoint alphabet, where the only thing that is not universal about the information measures is the process-specific Boltzmann constant, kY, that needs to be used.

Example: SSR processes

We now demonstrate explicitly that the generalised entropy that –according to the previous section– is identified with the pull-back entropy, SX, indeed does measure information production. We do that by considering slowly driven sample space reducing (SSR) processes, X, for which the generalised entropy functional is exactly known [21]. SSR processes are models of driven non-equilibrium systems. They are characterized by the fact that as the process unfolds the number of states accessible to the process reduces when no driving is present [30]. In its simplest form, the process relaxes to a ground state from which it has to be restarted. One can think of the process as a ball bouncing down a staircase with random jump sizes. The ball can only jump to steps lower than the last step it visited. Once it reaches the bottom of the staircase one lifts the ball to the top of the staircase (driving), and kicks it down the staircase again. The stairs represent (energy) states, the lowest being 1, the highest is W. The process exhibits path-dependence in the relaxation part, the current through the system breaks detailed balance between states. Such processes exhibit a Zipf law in their distribution function.

The micro-states, x, of the SSR process are sequences of states with elements xn ∈ {1, 2, ⋯, W} ≡ Ω. The transition probabilities between states ji are (18) where the first term on the right hand side describes the relaxing part of the SSR process (transitions only happen from higher i to lower states, j, i.e., when j < i) with prior distribution qi and cumulative distribution, . Θ is the Heavyside function. The second term captures the (slow) driving of the process. Slow here means that the system is only driven once the SSR process reaches its lowest position i = 1. SSR processes are Markovian since transition probabilities depend only on the current position and it is ergodic since after the relaxation process the system is reset to any state with probability, qi.

To understand the statistics of the process we are interested in the distribution of visits to the individual states. We define the macro-state to be the histogram, hi, of visits of X to state, i. It is possible to compute , where the multiplicity, M(k), is the number of different sequences, x, of length t with the same histogram h. One finds [21] (19) where pi = hi/t are the relative frequencies of observing a state i. Note that this is the Boltzmann entropy of the system, yet it is not of Shannon form since it is derived from a Markov, and not an independent sampling process. Similarly, one finds the cross-entropy (20) and by maximizing , (negative information divergence D), one obtains the characteristic Zipf distribution of the slowly driven SSR process [30] (21)

For the special case of qi = 1/W for all i, the Zipf distribution, , is obvious, since Qi = i/W and qi/Qi = 1/i. It continues to hold for “well-behaved” qi, [31]. For instance, if qiiα for α > −1, then Qii1+α and again qi/Qi ∝ 1/i.

In the next step we will use a simple example of a slowly driven SSR process in order to demonstrate how using extended alphabets works and how the respective generalised entropy functional is the adequate measure of information production.

Example of a small SSR system.

To demonstrate how a minimal adjoint alphabet for a slowly driven SSR process looks like, consider such a process over an initial alphabet of W = 4 symbols (numbers) representing the four states, . A SSR sequence in that alphabet might look like x = 421214321431212141⋯. Remember that the qi are normalized weights such that the probability to sample the state j < i conditional on the process being in state i is given by qj/Qi, with and by qj if the system is in the ground state i = 1 and the system gets driven. One can think of the adjoint SSR alphabet, , as the union of with the set of new symbols that represent all possible strictly monotonic decreasing sequences on , i.e., , where the new symbol “5” represents the sequence 21, “6” stands for 31, “7” for 321, “8” for 41, “9” for 421, “10” for 431, and “11” for 4321. Since we have 7 new symbols, n* = 7 extending the alphabet of original symbols {1, 2, 3, 4} we have a total of 11 symbols in the extended alphabet . The 7 parsing rules producing the new symbols are given by (22) and the map π = π7π6π5π4π3π2π1, which maps between messages written in the initial and the adjoint alphabet, can be constructed. We therefore can rewrite our example x = 421214321431212141⋯ into π(1)x = 4554354315541⋯, then π(2)x = 455435465541⋯, π(3)x = 45547465541⋯, π(4)x = 4554746558⋯, π(5)x = 954746558⋯, π(6)x = 9547[10]558⋯, and finally π(7)x = 95 [11][10]558⋯. We now project a distribution function, f, on to a distribution function, g, on . We remember that the letters 5 to 11 code for the following subsequence: π−15 = 21, π−16 = 31, π−17 = 321, π−18 = 41, π−19 = 421, π−110 = 431, and π−111 = 4321, and see that all the new letters with index 5 to 11 represent sequences-junks that contain a 1. That is . Letter 2 is part of the sequences-junks represented by the extended letters 5, 7, 9, and 11. That is . Similarly we can find for i = 3 and i = 4. As a consequence the distribution functions gi of the message x in original letters i = 1, ⋯, 4 and the distribution function fz of the adjoint message π(7)x in extended letters z = 1, 2, ⋯, 11 are given by four equations: (23) where Z is a normalization constant such that . Note that after applying π to a SSR process yields (asymptotically) that f2 = f3 = f4 = 0. We can now express the asymptotic relative frequencies, i.e. the probabilities, fz, in terms of the weights qi on the SSR states i = 1, 2, 3, 4, and get, f5 = q2, f6 = q3q1/(q1 + q2), f7 = q3q2/(q1+ q2), f8 = q4q1/(q1 + q2 + q3), and so forth. Inserting the expressions for fz in Eq (23) one self-consistently obtains the marginal distribution on the original alphabet (24) as predicted from Eq (21);–note that if qi = 1/W is uniform, then the solution qi/Qi = 1/i is exactly reproducing Zipf’s law and for a broad variety of choices for qi one obtains approximate Zipf laws. That means that f fulfils the matching constraints of Eq (8) exactly and therefore also the lift, , of the asymptotic marginal distribution function, g, to f is exact and is given by . That is, we can see in this simple example how the distribution function g over the original alphabet can be predicted from knowing the distribution function f of letters of the extended alphabet.

Since slowly driven SSR processes are in fact also Markov processes, meaning, they are processes where the probability to sample the the state of the process at t + 1 only depends on the state the system at the previous time-step t (compare transition probabilities Eq (18), one can also proof that the respective generalised entropies are actually the adequate information measures in this context. It is well known that the information production of a Markov process can be measured by the so-called conditional entropy, . This is a functional that depends on the probabilities p(2) = pij, that a symbol j follows symbol i in the process. The SSR entropy on the other hand depends on p(1) = pi is the marginal distribution of the symbols, i. If p(2) is the maximizer of the conditional entropy, or more precisely, the minimizer of the conditional information divergence, and p(1) is the minimizer of the SSR information divergence, then both estimators of entropy, the conditional entropy and the SSR entropy, are identical, , for all choices of the system parameters q. For details of the computation, see Text 5 in S1 File.

Discussion

We showed that by identifying the entropy of a process with its information production it is possible to consistently extend the fundamental notion of entropy in statistical physics –Boltzmann entropy– to non-i.i.d. processes and processes that operate out of equilibrium. This is done by identifying isomorphisms that map entire process classes to adjoint representations where processes are i.i.d. The sample space (or alphabet) of the adjoint process is typically much larger than the sample space of the original process. The isomorphisms can be thought of concatenations of parsing rules that map strongly correlated segments in the original process to new symbols. Information production of the adjoint i.i.d. process is quantified by Shannon entropy. Pulling back the entropy measure in the adjoint space to the original sample space and comparing the resulting functional with the Boltzmann entropy (process-specific log multiplicity) establishes the asymptotic equivalence of the notion of generalised entropy and information production.

This provides a comprehensive image that consistently links information theory and the statistical physics of categorial non i.i.d. processes in a context-sensitive way that allows us to consistently associate a notion of entropy, a cross-entropy (representing the constraints of the maximum entropy principle), and an information divergence (or relative entropy) to complex processes. Context-sensitive means that the functional form of the entropy depends on the class of processes considered; the concept of entropy itself, information production from the information theoretic perspective and Boltzmann entropy from the physics perspective, remains untouched.

If an adjoint representation of one process is found, one can find the adjoint representations of an entire class of processes that all de-correlate in their representations over the same adjoint sample space. This means that there exists a natural way how processes implicitly define their own generalization to an entire process class. This is possible because the property of de-correlating over the same adjoint sample space implements an equivalence relation. This has important consequences since these equivalence classes of processes generalize the idea of an ensemble to non-i.i.d. processes which provides a concise way –grounded in first principles– to extend the program of statistical physics to complex processes and their macro variables.

Supporting information

S1 File.

The supporting informations supply seven texts that cover: SI Text 1 Shannon’s example of random texts from different alphabets: Here Shannon’s examples from his seminal paper about information theory are given in order to give an intuitive demonstration on the effects of extending alphabets e.g. from letters to words. SI Text 2 Minimal description length, i.i.d. processes, and compression A brief discussion on how compression works for i.i.d. processes. SI Text 3 Generative grammars, parsing rules and parsing rule templates: A brief discussion of the role parsing rules and their role in Generative grammars. We also discuss more broadly how we distinguish a parsing rule template from a particular parsing rule. SI Text 4 Information production and Kolmogorov complexity: Some remarks on how Information production relates to Kolmogorov complexity. [SI Text 5 Detailed algebraic steps for Eq. (29): The algebraic steps leading up to Eq (29) of this paper are given in detail. SI Text 6 About conjugate representations of process families: A brief discussion of issues concerning the existence of adjoint representation of entire process families rather than adjoint representations of a single process. SI Text 7 Measure concentration, typicality and asymptotic equivalence: A more detailed discussion of what asymptotic equivalence means in Eq. (15), explaining in which sense the generalized information measures based on Boltzmann entropy are equivalent to the pull-back information measures derived in this paper.

https://doi.org/10.1371/journal.pone.0290695.s001

(PDF)

References

  1. 1. Shannon CE. 1948 A Mathematical Theory of Communication. Bell Syst. Tech. J. 27 379–423 & 623–656.
  2. 2. Kaniadakis G, Lissia M, and Scarfone AM. 2005 Two-parameter deformations of logarithm, exponential, and entropy: A consistent framework for generalized statistical mechanics. Phys. Rev. E 71, 046128. pmid:15903747
  3. 3. Tsallis C, Gell-Mann M, and Sato Y. 2005 Asymptotically scale-invariant occupancy of phase space makes the entropy Sq extensive. Proc. Nat. Acad. Sci. USA 102 15377–15382. pmid:16230624
  4. 4. Naudts J. 1974 A generalised entropy function. Comm. Math. Phys. 37 175–182.
  5. 5. Hanel R, Thurner S, and Gell-Mann M. 2014 How multiplicity of random processes determines entropy: derivation of the maximum entropy principle for complex systems. Proc. Nat. Acad. Sci. USA 111 6905–6910. pmid:24782541
  6. 6. Hanel R, and Thurner S. 2018 Maximum configuration principle for driven systems with arbitrary driving. Entropy 20 838. pmid:33266562
  7. 7. Plastino A R and Plastino A. 1993 Stellar polytropes and Tsallis’ entropy. Phys. Letters A 174 384–386.
  8. 8. Wald R. 1993 Black hole entropy is the Noether charge. Phys. Rev. D 48 R3427(R). pmid:10016675
  9. 9. Wen-Sheng Xu and Freed KF. 2015 Generalized Entropy Theory of Glass Formation in Polymer Melts with Specific Interactions. Macromolecules 48 2333–2343.
  10. 10. Tsallis C, Mendes RS, and Plastino AR. 1998 The role of constraints within generalized nonextensive statistics. Physica A 261 534–554.
  11. 11. Chavanis P-H. 2006 Coarse-grained distributions and superstatistics. Physica A 359 177–212
  12. 12. Byrnes CI, Georgiou TT, and Lindquist A. 2001 A Generalized Entropy Criterion for Nevanlinna-Pick Interpolation with Degree Constraint. IEEE Trans. Automatic Control 46 822–839.
  13. 13. Dudowicz J, Freed KF, and Douglas JF. 2014 Generalized entropy theory of glass formation. J. Chem. Phys. 141 234903.
  14. 14. Goldstein S and Lebowitz JL. 2004 On the (Boltzmann) entropy of non-equilibrium systems. Physica D 193 53–66.
  15. 15. Lieb E and Yngvason J. 2013 The entropy concept for non-equilibrium states. Proc. R. Soc. A 469 20130408. pmid:24101892
  16. 16. Thurner S, Corominas-Murtra B, and Hanel R. 2017 The three faces of entropy for complex systems: information, thermodynamics and the maxent principle. Phys. Rev. E 96 032124.
  17. 17. Gao X, Gallicchio E, Roitberg AE. 2019 The generalized Boltzmann distribution is the only distribution in which the Gibbs-Shannon entropy equals the thermodynamic entropy. J. Chem. Phys. 151, 034113 pmid:31325924
  18. 18. Lecomte V, Appert-Rolland C, and van Wijland F. 2007 Thermodynamic formalism for systems with Markov dynamics. J. Stat. Phys 127 51–106
  19. 19. Pregowska A, Kaplan E, and Szczepanski J. 2019 How Far can Neural Correlations Reduce Uncertainty? Comparison of Information Transmission Rates for Markov and Bernoulli Processes. Internat. J. of Neur. Syst., 29 1950003 pmid:30841769
  20. 20. Hanel R, Thurner S, and Tsallis C. 2009 Limit distributions of scale-invariant probabilistic models of correlated random variables with the q-Gaussian as an explicit example. Europ. Phys. J. B 72 263–268.
  21. 21. Hanel R, Corominas-Murtra B, and Thurner S. 2017 Understanding frequency distributions of path-dependent processes with non-multinomial maximum entropy approaches. New J. Phys 19 033008.
  22. 22. Korbel J, Lindner SD, Hanel R, and Thurner S. 2021 Thermodynamics of structure-forming systems. Nature Com. 12 1127. pmid:33602947
  23. 23. Chomsky N. 1956 Three models for the description of language. IRE Trans. Info. Theory 2 113–124.
  24. 24. Chomsky N and Schützenberger MP. 1963. The algebraic theory of context free languages, in Braffort P, Hirschberg D (eds.) Computer Programming and Formal Systems. Amsterdam: North Holland. 118–161.
  25. 25. Chaitin GJ. 1974 Information-theoretic limitations of formal systems. J. of the ACM 21 403–434.
  26. 26. Kolmogorov AN. 1963 On Tables of Random Numbers. reprint in. Sankhya: The Indian Journal of Statistics A 25 369–376. Kolmogorov A N. 1965 Three Approaches to the Quantitative Definition of Information. Problems Inform. Trans. 1 1–7.
  27. 27. Solomonoff R. 1964. A Formal Theory of Inductive Inference Part I & II. Information and Control 7 1–22 & 224–254.
  28. 28. Kraft LG. 1949. A device for quantizing, grouping, and coding amplitude modulated pulses Cambridge, MA: MS Thesis, Massachusetts Institute of Technology, hdl:1721.1/12390.
  29. 29. McMillan B. 1956 Two inequalities implied by unique decipherability. IEEE Trans. Info. Theory 2 115–116.
  30. 30. Corominas-Murtra B, Hanel R, and Thurner S. 2015 Understanding scaling through history-dependent processes with collapsing sample space. Proc. Nat. Acad. Sci. USA 112 5348–5353. pmid:25870294
  31. 31. Corominas-Murtra B, Hanel R, and Thurner S. 2016 Extreme robustness of scaling in sample space reducing processes explains Zipf-law in diffusion on directed network. New J. of Phys. 18 093010.