Equivalence of information production and generalised entropies in complex processes

Complex systems with strong correlations and fat-tailed distribution functions have been argued to be incompatible with the Boltzmann-Gibbs entropy framework and alternatives, so-called generalised entropies, were proposed and studied. Here we show, that this perceived incompatibility is actually a misconception. For a broad class of processes, Boltzmann entropy –the log multiplicity– remains the valid entropy concept. However, for non-i.i.d. processes, Boltzmann entropy is not of Shannon form, −k∑ipi log pi, but takes the shape of generalised entropies. We derive this result for all processes that can be asymptotically mapped to adjoint representations reversibly where processes are i.i.d. In these representations the information production is given by the Shannon entropy. Over the original sampling space this yields functionals identical to generalised entropies. The problem of constructing adequate context-sensitive entropy functionals therefore can be translated into the much simpler problem of finding adjoint representations. The method provides a comprehensive framework for a statistical physics of strongly correlated systems and complex processes.


Introduction
To know the information content of a process, a system, a source, a signal, or a sequence, one uses entropy to quantify it.If systems or processes are independent identically distributed (i.i.d.), ergodic and stationary in their probabilities, it is known what to do: one uses the expression [1], where i = 1, 2, . .., W are the states the system can take and p i is the probability to observe them.So-called Shannon entropy is given by S(p) = kH(p), where k > 0 is a constant that specifies the units of entropy, in statistical physics k = k B is the Boltzmann constant; if we measure information in bits per symbol, k = 1/log 2. If the system or process of interest is not i.i.d., ergodic, or in stationary equilibrium, then it becomes less clear what to do in order to obtain its correct information content.In principle there are two conceptually very different paths to solve the problem.
The first way (which we present here in this paper) is to look at the information production of the process or system.Consider a process, X, that emits signals x(T) of length T with x (T) = x T x T−1 . ..x 1 .Such an ordered list/sequence of elements x(T) is often referred to as a Ttuple (or T-gram).Every symbol, x t , in the signal is an element of a fixed sample space or an alphabet, A, that contains all possible states x t can take.For example, if you think of X as a text-producing author, states are the letters from the English alphabet, A letters , or in a "binary alphabet", A 0;1 of zeros and ones, once the text is stored on a computer.In non-i.i.d.systems, symbols within sequences will in general be correlated in one way or another.Those correlations -that may extend over many different scales in the system-carry information about the system.Using the marginal distribution functions of the occurrences of states (letters) in Eq (1) will then certainly not provide the correct information content of the process.However, if we know the probability distribution, p T , to observe entire sequences, x(T), we compute the information production of process, X, as see also Text 2 in S1 File for details.This changes the perspective from individual symbols, events, or states to entire sequences, or paths.From an information theoretic point of view, I (X) measures the average number of bits required to reversibly encode samples of X into bitstreams that can be sent through information channels and measure information production in bits per emitted symbol, i.e., k = 1/log (2).Note that kH(p T ) measures bits per T-tuple, i.e. per path-segment-of-length-T.Therefore, by using k T = k/T, one measures again bits per emitted symbol of the original alphabet.However, the number of all possible T-tuples -the size of the new "alphabet"-is enormous if T is large.If the sample space contains A ¼ jAj symbols, the size of the alphabet for all paths is A T .The sum over all states and knowing their probabilities, p T , will in general be impossible.The reason why I(X) is the true information production rate of process X, is because there exist representations of X in terms of other processes, Y, such that sequences, x, presented in the symbols from the initial alphabet, A 0 , can be rewritten into sequences, y, using other symbols from a much larger alphabet.The definition of information production, I(X), asymptotically uses the largest alphabet containing each possible sequence as a unique symbol, whichas a consequence-are statistically independent.In other words, all structure (correlations) gets absorbed into new symbols belonging to an extended alphabet.Only, the definition of information production uses the largest possible alphabet; the alphabet of T-tuples over the original alphabet and T ! 1.
However, typically, there exist much smaller alphabets that can capture all structures of a process or system.We say Y is adjoined to X (see subsection Constructing adjoint process spaces for details) if Ys symbols are uncorrelated and, thus, H measures the correct information content.In general, the symbols, z, in the extended alphabet, encode for a number, ℓ, symbols in the original alphabet.For instance, if z is a symbol in the alphabet of T-tuples, then T = ℓ(z), z consisting exactly of T letters of the original alphabet.The average length, � ' ¼ h'i ¼ P z f z 'ðzÞ, of original symbols emitted per symbol in the extended alphabet increases with the size of the extended alphabet.f z = p z (Y) is the distribution function of letters z.Consequently, the unit of information, k, adapts to the "complexity" ℓ(z) and k !k= � '.
For an example of how one can encode information about correlations of a process X on all relevant scales imagine an initial alphabet of Latin letters and extend it to a series of extended alphabets: one that contains syllables in addition to letters, one that adds word-fragments, one that includes words, one with frequent word combinations, one with phrases, and so on.This sequence of alphabets is nested in the sense that they contain each other, With any one of these alphabets, say A n , one can sample "text" from it by using the associated marginal distributions, p λ (n), of its elements, λ.With increasing n, the resulting artificial text samples will more and more resemble the English text body from which the marginal distributions p λ (n) were derived; see Text 1 in S1 File for the examples given by C. Shannon.We can find particular sequences of alphabets such that each alphabet, A nþ1 , contains exactly one more symbol than A n , by applying reversible substitutions of symbols, which we refer to as "parsing rules".For details, see Text 3 in S1 File.
The second way is to directly capture the correlations and structures in non-i.i.d.systems / processes in a generalised functional form of the entropy, which -typically-looks more complicated than Eq (1), an approach that so far has been successful for a handful of processes.Generalised entropy functionals are usually expressed in terms of marginal distributions in the "original alphabet".These generalised entropies have been extensively studied for several decades [2][3][4][5][6] from different angles, generally for systems with strong or long-range correlations [7][8][9], that are non-ergodic, internally constrained [10][11][12], or for systems out of equilibrium [13][14][15].
For non-i.i.d.systems or processes it is essential to specify the context in which the term entropy is used, whether one talks about information theory, thermodynamics, or the maximum entropy principle (MEP) [16].While thermodynamic aspects of such systems, especially the existence of well defined thermodynamic potentials (and as a consequence, temperature) is heavily debated, there is wide consensus that entropy production (the physical analogue of information production) remains a valid concept, also for these systems.For thermodynamic considerations away from equilibrium and i.i.d. and applications e.g. to neural correlations see for instance [17][18][19].
Here we will not focus on thermodynamic aspects of entropy, but on the original context envisioned by Boltzmann: its power to predict a particular macro state from knowing the number of micro states (multiplicity) corresponding to it.This allows one to predict typical distribution functions and derive functional relations between macro state variables (expectation values of respective observables).This view that is tightly related to the MEP is not restricted to physics and is not limited to i.i.d.processes.For specific cases the respective MEP functionals, the generalised entropy and cross entropy, have been explicitly constructed [5,6].In particular, if multiplicity and probabilities are multiplicatively separable in the assymptotic limit, [5], a clear definition of cross entropy is possible also for non-i.i.d.systems.
While the Boltzmann entropy concept remains untouched (log multiplicity) its functional form, i.e., the generalised entropy functional, depends on the context of the process, X, and the process class, F, to which it belongs to.Some examples for different process classes include i.i.d. processes, exchangeable and polynomial mixture processes [20], Polya processes [21], sample space reducing processes [6], and processes describing structure forming systems [22].
The idea behind generalised entropies is to quantify entropy as the logarithm of the number of micro states (multiplicity) of process, X.It is based on the marginal distribution function, g i = p i (X), of symbols i from a sample space (here the original alphabet is A 0 ) that compose a functional, SðgÞ, such that it captures the structural information in X.Similarly, one obtains generalised expressions for cross entropy and information divergence.Those functionals effectively capture arbitrarily complicated relations between symbols of the sample space (original alphabet) in terms of marginal symbol frequencies, g, of the process X. Generalised entropies (that fulfil the first Shannon-Khinchin axiom) do not explicitly depend on system parameters that identify a process within a process class or other details.It is obvious that in general, constructing such a functional may be complicated and has not been achieved convincingly, except for a few exceptions, e.g.[6,21].As we will see, one can reconstruct (or at least approximate) such functionals from data-at least in principle, since there exist fundamental limits to reconstructing generative grammars from data on the basis of statistical inference alone, [23,24], a fact also captured in Chaitin's incompleteness theorem, [25].In other words, the question of whether some data, a particular sequence, x, contains regular structures that can be used to compress it, may become undecidable.
Here we show that the two approaches, information production and the generalised entropy functionals can be mapped to one another, meaning that they are the same.The diagram in Fig 1 schematically shows the basic idea: We first use the method of parsing rules to construct an adjoint representation, Y, of a given process X, and write Y = πX.Here π is a map that reversibly encodes all structures in X, such that process Y in its new extended alphabet is i. i.d. and is therefore fully described by their marginal distribution function, f z = p z (Y), where z is again a letter from the extended alphabet.Consequently, the Shannon information measure with the appropriate unit of information is adequate (first way).Next, we project the marginal distributions from the adjoint representation, f z , to the original alphabet and, in a last step, we identify the process specific "pull-back" information measures, S X , S cross X , and D X , which get precisely defined in Eq (14) below, that takes distribution functions over the original alphabet, with the corresponding generalised information measure (second way) by adequately lifting distribution functions over the original alphabet to distributions over the adjoint alphabet.For the proof we use the minimal description length (MDL) (see also Text 2 in S1 File), the length of the shortest encoding that fully represents the data.In this context, we shall see that information production is tightly related to the notion of Kolmogorov complexity [25][26][27]; For a brief discussion, see Text 4 in S1 File.We explicitly demonstrate the method in an example for the class of sample space reducing (SSR) processes [21] that provide simple, analytically tracktable models for driven dissipative systems, that typically exhibit distribution functions that are power laws.Their generalised entropy is exactly known for arbitrary driving rates [6,21].
The purpose of the paper is to show that indeed, S X (X) = k Y H(Y), represents generalized entropies in the second way.The proof is given constructively in the following section.

Results
The key tool used in the following are simple substitution rules, parsing rules, that allow us to reversibly re-code (possibly correlated) data streams into new symbol streams that no-longer carry structure; see Text 3 in S1 File.The structure of a parsing rule we refer to as a template.A particular substitution rule derived from a template we refer to as a parsing rule.
The simplest parsing rule template, that we refer to as the elementary template, can be denoted by [r s !m], meaning that two symbols r and s that appear together are substituted with a new symbol m.In the following we will associate m also with the symbol index, i.e. it is the m-th symbol in an alphabet.
To recode data one may choose a suitable set of parsing rule templates.We speak of a relevant set of parsing rule templates if (i) one can extract the full information content of a process, X, asymptotically, solely by using parsing rules from the set of templates, and (ii) if omitting one template from the set does not allow one to do so.In the following we focus on processes for which the elementary parsing rule template forms a relevant set.However, the arguments presented here extend naturally to more general sets of parsing rule templates; see Text 3 in S1 File.
A generalization of the framework we present from categorial to continuous random variables can be considered via first coarse-graining the continuous dynamics and then taking the limit to ever finer scales of coarse graining, however at considerable measure theoretic cost.Genuinely reformulating the theory in a framework of path-integrals will mainly be limited by translating the concepts of alphabet and parsing rules, which are inherently discrete objects, into a framework of continuous variables.

Constructing adjoint process spaces
Decidability issues may forever limit our ability to design algorithms for reconstructing optimal adjoint representations from data that are not context specific.On the other hand, for any finite data over finite sample spaces and finite relevant set of parsing rule templates one can in principle find the optimal proxy to an adjoint representation by extensive search algorithms, even though time issues forbid such extensive searches in practical implementations.The general procedure underlying the construction of adjoint representations always remains the same.
Suppose X is a process that emits symbols i drawn from the alphabet A 0 ¼ f1; 2; � � � ; Wg, where W is the number of symbols.X generates data streams, x(t) = x t x t−1 � � �x 1 , where every x t is one of the available symbols in A 0 , which contains W elements.Consider now two letters r 1 and s 1 such that the pair r 1 s 1 � x τ x τ−1 for some positions τ in the data x, has been identified to contain relevant information (e.g. because the pair is over-expressed), then we can rewrite the pair r 1 s 1 by a new letter m 1 = W + 1, which will become the first letter extending alphabet We can iterate and produce parsing rules π n = [r n s n !W + n], with letter indices r n < W + n and s n < W + n.Where π n maps data over the alphabet A nÀ 1 to data over Note that the parsing rules π n can be uniquely inverted, i.e. we can expand data over A n to data over A nÀ 1 using the inverse map p In other words the inverse parsing rules can be thought of being part of a "generative grammar", [23].We therefore can construct a sequence of maps π(n) = π n π n−1 � � �π 1 such that data x can be mapped to representations y n = π(n)x.At every parsing level, n, we get a corresponding distribution function of the re-coded data, p z (y n ), with a letter index z = 1, 2, � � �, W + n.
The Kraft and McMillan theorems [28,29] tells us that if all that we know about a process are its marginal relative frequencies, g i , at which symbols i occur, there exists a shortest reversibly encoding of the data, x, of characteristic length, L min , the minimal description length (MDL) that gives the theoretically achievable minimal length of x (in units of bits).L min is a lower bound for the true MDL, L(x), that can only be attained asymptotically.The theorems state that As discussed above, the "complexity" of a symbol ℓ(z) is the number of letters in the original alphabet it codes for.As a consequence we can compute the average symbol complexity for data y n*(t) (t) to be given by h'i Y n * ðtÞ ' jxðtÞj=jy n * ðtÞ ðtÞj.As a consequence, we get for the adjoint For simplicity, suppose there is a maximal n* that holds for all t.Since we assumed that y n* is already indistinguishable from an i.i.d.process, applying another parsing rule would only compress the data without changing its MDL.This means, k Y n * þ1 HðpðY n * þ1 ÞÞ ¼ k Y n * HðpðY n * ÞÞ.If, conversely, we look at a parsing level, n, where the adjoint process is not yet i.i.d., then we can find a parsing rule, π n+1 , such that k Y nþ1 HðpðY nþ1 ÞÞ < k Y n HðpðY n ÞÞ.Additional knowledge always reduces the attainable information production rate.
In principle, for any finite amount of data x(t) one can construct the optimal map, π(n), for the process, X, by minimizing over all possible sequences of parsing rules, at any fixed parsing level n, if only we know the relevant set of parsing rules to consider and this set is finite.Then we can in principle find n = n*(t) such that no further reduction of the minimal description length is possible by applying any more parsing rules.In practice, an extensive search over all possible sequences of parsing rules is of course not feasible, even if the set of parsing rule templates only consists of the elementary template, and algorithms for inferring adjoint representations of data need to turn to different means of optimization.For theoretical considerations we may, however, assume that for a given finite relevant set of parsing rule templates and any finite t we can find the optimal map π (or one of several possible optimal maps if the map is not unique) or at least a map reasonably close to optimal, since the number of possible maps we would have to evaluate remains finite, too.Intuitively it is clear however, given an unknown process X for which we cannot pre determine the respective relevant set of parsing rule templates, that one typically can no longer decide whether a map π is optimal or not.
However, given an optimal π, the adjoint i.i.d.process, Y, is fully characterized by its marginal distribution, f = p(Y), over symbols in the adjoint alphabet, A * � A n * , and the information production, I(X) = k Y H(f), is given by the Shannon entropy of Y. On the adjoint process space F* we can use the measures of Shannon entropy, cross-entropy, and Kullback-Leibler information divergence, given that we use the appropriate unit of information, k Y of Eq (5).Since Y is i.i.d.over A * , the adjoint space naturally belongs to the family, F*, of all i.i.d.processes over this alphabet.Further, any Y 0 in F* is fully characterized by its marginal distribution, f 0 , and the pair (f 0 , π), determines the process X 0 = π −1 Y 0 .Hence, the process class, F π , that naturally generalizes a process, X, with adjoint i.i.d.process, Y = πX, is given by This construction completes the first part of the proof that establishes that we can essentially measure information production of a process as the Shannon entropy of the adjoint process.This entropy, however uses the marginal distributions over the adjoint alphabet as arguments and can therefore not be identified directly with the generalised entropies that use the marginal distributions over the original alphabet as arguments.In the next step we will pull the information measures over the adjoint message space back to to the original message space.

Information measures over extended alphabets
Suppose we have a process X with an adjoint process Y = πX.For data, x, and its adjoint sequence, y = y n , we obtain two histograms, h i (x), of symbols i 2 A 0 and, h z (y), of symbols z 2 A n , respectively.The associated relative frequency distributions are given by g = p(x) = h (x)/|x| and f = p(y) = h(y)/|y|.Further, every symbol, z, represents a number of ℓ(z) symbols, π −1 z, in the original alphabet with π = π(n).We define � h i ðzÞ ¼ h i ðp À 1 zÞ as the histogram of letters i 2 A 0 that are parsed together into the symbol z 2 A n .For z � W where W ¼ jA 0 j, we have, � h i ðzÞ ¼ d iz and h i ðxÞ ¼ P z2A n � h i ðzÞh z ðyÞ needs to hold for all i 2 A 0 .This provides us with constraints, that we need in the next section.As a consequence, we have We drop the arguments (x) and (y) (or (y n )) from now on and distinguish histograms by their index i (over A 0 ) and z (over the adjoint alphabet A f .This means that we can write the constraints that link distributions g over A 0 , Eq (6), with distributions f over A * in the following way As mentioned before, the process class, F π , that X belongs to, is completely determined by the map π and the process X, by the pair (f, π), see Fig 1 .Therefore, we can identify the entropy of X with with the process-specific Boltzmann factor, k Y � k/(hℓi f ).For processes, X and X 0 , and with f 0 = p(πX 0 ), the cross-entropy and the information divergence are In the special case where X is already an i.i.d.process, no features can be extracted from the data and n = 0, π = π(0) = id, and ℓ(z) = 1, for all z = 1� � �W.Consequently, S π = kH, S cross p ¼ kH cross , and D p ¼ kD KL (Kullback-Leibler divergence), as required.

Pulling back entropies to the original alphabet
In the next step one can construct entropy functionals over the original alphabet, A 0 , by lifting a distribution function, g 0 , on A 0 to a distribution function, f 0 , over A * by assuming that f = p (πX) is the true distribution function of the process, Y = πX.We proceed by minimizing the information divergence, D π (f 0 ||f), with respect to f 0 .More precisely, we minimize the functional ψ(f 0 , α, η) given by with Lagrange multipliers, α and η i , that normalize f 0 and guarantee the constraints from Eq (8).Solving the variational principle δψ = 0 estimates f 0 at the minimum that is compatible with g 0 .We identify this minimizer as f ðg 0 Þ and obtain α = 1/hℓi f and which has to be solved self consistently.If f already meets all matching constraints with g 0 , i.e.
if g 0 = g with g = p(X), then we have This means that one can lift marginal distributions, g 0 , on A 0 to distributions, f ðg 0 Þ, on A n with respect to a particular process, X.As a consequence one can pull the entropy, crossentropy, and divergence back from distributions, f, over the adjoint sample space, A n , to distributions, g, over the initial alphabet, A 0 , with respect to a particular process, X 2 F π .In particular, one can define the projection operator, π * , through g = p(π −1 y) � π * p(y) = π * f and the operator, p * X , that lifts distributions, g, over alphabet, A 0 , to distributions f over the extended alphabet, A * , through p * X g � minarg ff 0 j g¼p * f 0 g D p ðf 0 jjpðp XÞÞ ; ð13Þ with respect to the process X, i.e., with respect to the distribution function, p(πX), of the i.i.d.
where, typically, g = p(X).We call those measures the pull-back entropy, cross-entropy, and information divergence of the process X. Note, that while S π , S cross p , and D π are universal on the entire class of processes F π , pulling the information measures back to marginal distributions g over A 0 yields information measures that are specific to a particular process X 2 F π .

Generalised entropies over initial alphabets
The final question is how the pull-back measures S X , S cross X , and D X , defined in Eq (14), are related to generalised entropy functionals as derived for example in [6,21].There, functionals were derived to obtain the most likely histogram, h, observed in a given process after t observations (for large t).Even for non i.i.d.processes, often the probability, P(h|θ), to observe the particular histogram, h, for t = ∑ i2O h i observations factorizes P(h|θ) = M(h)G(h|θ) into a multiplicity, M(h), and a probability term of the sequences, G(h|θ).θ is a set of parameters that determines the process, X-it defines and parametrizes the process class X belongs to.Whenever such a factorization is possible, one can show that a generalised maximum entropy principle exists.Using the Boltzmann definition of entropy, the logarithm of multiplicity, S ¼ log M=t, and defining Scross ¼ À log G=t, and a generalised information divergence as DðgjyÞ ¼ À t À 1 log PðhjyÞ, where g = h/t, the standard relations D ¼ Scross À S remain valid [5].
If we are looking at a family of processes X(θ) rather than a single process X then we can no longer assume a priorly that the same map π (1) that takes some process X(θ (1) ) to an adjoint representation efficiently is the same map π (2) that takes some process X(θ (2) ) from the family efficiently to an adjoint representation.However, there are ways in which we can think of constructing a single map π to some extended alphabet A * such that all processes X(θ) decorrelate under the same map π.In the weakest version we can start with some extended alphabet A * and the space of all i.i.d.processes F* over that alphabet and a map π and consider the process family F π = π −1 F*, in which case the natural parameters of the processes X(θ) are given by θ = f, where f are distribution functions over A * .For a discussion of stronger versions see Text 6 in S1 File.
However, if above assumption is valid, then X(θ) forms a sub-class of processes in F π with distribution functions, g(θ) = p(X(θ)), and adjoint distribution functions, f(θ) = p(πX(θ)).similarly we have � 'ðyÞ ¼ P z f z ðyÞ'ðzÞ, y 0 = πx 0 and t ¼ jx 0 j ¼ jy 0 j � 'ðyÞ.And in the next step we note that the probability of observing a histogram h 0 under process X(θ) is given by with ĥ * being the respective maximizing argument h*, the histogram over the extended space.
The expression h 0 ¼ � hh * are the constraints between the histograms on the original and adjoint alphabet.Since y are sampled i.i.d, in the last line is the multinomial distribution of histograms h*.Also note that in the one but last line of Eq (15) ' stands for asymptotically equivalent, meaning, that the relative error we make in log P (h 0 |θ) by replacing P(h 0 |θ) with P(h*|f(θ)) vanishes in probability as t ! 1.For more details see Text 7 in S1 File.We note, that this asymptotic equivalence is a consequence of processes that can be mapped into an adjoint representation becoming i.i.d.there and therefore implicitly can be thought of being ergodic on sufficiently large time scales.Taking logs and multiplying both sides with −k/t and setting g 0 = h 0 /t and f* = k*/|k*|, we then obtain Dðg 0 jyÞ ' min where ' again means asymptotically identical for large t; from the second to the third line we used the definition of the lift operator from Eq (13) and D X from Eq (14).In other words, we have shown that in the limit t ! 1 the generalised information divergence D(g 0 |θ) is identical to the pull-back divergence D X(θ) (g 0 |g(θ)) as a functional.As a consequence we can use that D X ¼ S cross X À S X and identify S X with the generalised entropy and S cross X with the generalised cross entropy.We see that the generalised entropy, S, for the processes family, X(θ), is given by SðgÞ ¼ S XðyÞ ðgÞ.
We conclude that for all process classes (at least those that that decorrelate over a common map π) there exist notions of entropy, S X , cross-entropy, S cross X , and divergence, D X , that behave in the usual way, namely, D X ¼ S cross X À S X .This means that for such processes (at least asymptotically) the probability to observe a particular histogram, P ' exp(−tD X ), factorizes into a multiplicity term, M ' exp(tS X ), associated with entropy, and a sequence probability term, G ' expðÀ tS cross X Þ, associated with cross-entropy.In other words, we have shown that Boltzmann's entropy, the logarithm of multiplicity, remains the correct estimator of information production.For complex systems the multiplicity will differ from a multinomial coefficient in arbitrarily complicated ways that might even depend explicitly on system parameters θ, which would mean a violation of SK1 axiom.In other words, the Boltzmann pull-back entropy functional typically will be more complicated than the Shannon entropy functional and can even violate SK1-yet they are the appropriate generalizations of entropy, in terms of information production.We also learned that beyond such parametric families of generalised entropies, i.e. beyond the pull-back measures, we find the standard notions of entropy, cross-entropy, and divergence present in the adjoint alphabet, where the only thing that is not universal about the information measures is the process-specific Boltzmann constant, k Y , that needs to be used.

Example: SSR processes
We now demonstrate explicitly that the generalised entropy that -according to the previous section-is identified with the pull-back entropy, S X , indeed does measure information production.We do that by considering slowly driven sample space reducing (SSR) processes, X, for which the generalised entropy functional is exactly known [21].SSR processes are models of driven non-equilibrium systems.They are characterized by the fact that as the process unfolds the number of states accessible to the process reduces when no driving is present [30].In its simplest form, the process relaxes to a ground state from which it has to be restarted.One can think of the process as a ball bouncing down a staircase with random jump sizes.The ball can only jump to steps lower than the last step it visited.Once it reaches the bottom of the staircase one lifts the ball to the top of the staircase (driving), and kicks it down the staircase again.The stairs represent (energy) states, the lowest being 1, the highest is W. The process exhibits path-dependence in the relaxation part, the current through the system breaks detailed balance between states.Such processes exhibit a Zipf law in their distribution function.
The micro-states, x, of the SSR process are sequences of states with elements x n 2 {1, 2, � � �, W} � O.The transition probabilities between states j !i are where the first term on the right hand side describes the relaxing part of the SSR process (transitions only happen from higher i to lower states, j, i.e., when j < i) with prior distribution q i and cumulative distribution, Q j ¼ P j i¼1 q i .Θ is the Heavyside function.The second term captures the (slow) driving of the process.Slow here means that the system is only driven once the SSR process reaches its lowest position i = 1.SSR processes are Markovian since transition probabilities depend only on the current position and it is ergodic since after the relaxation process the system is reset to any state with probability, q i .
To understand the statistics of the process we are interested in the distribution of visits to the individual states.We define the macro-state to be the histogram, h i , of visits of X to state, i.It is possible to compute S SSR ðpÞ ¼ 1 t log MðkÞ, where the multiplicity, M(k), is the number of different sequences, x, of length t with the same histogram h.One finds [21] S SSR ðpÞ ¼ À where p i = h i /t are the relative frequencies of observing a state i.Note that this is the Boltzmann entropy of the system, yet it is not of Shannon form since it is derived from a Markov, and not an independent sampling process.Similarly, one finds the cross-entropy and by maximizing c ¼ S SSR À S cross SSR , (negative information divergence D), one obtains the characteristic Zipf distribution of the slowly driven SSR process [30] For the special case of q i = 1/W for all i, the Zipf distribution, p i � 1 i , is obvious, since Q i = i/ W and q i /Q i = 1/i.It continues to hold for "well-behaved" q i , [31].For instance, if q i / i α for α > −1, then Q i / i 1+α and again q i /Q i / 1/i.
In the next step we will use a simple example of a slowly driven SSR process in order to demonstrate how using extended alphabets works and how the respective generalised entropy functional is the adequate measure of information production.
Example of a small SSR system.To demonstrate how a minimal adjoint alphabet for a slowly driven SSR process looks like, consider such a process over an initial alphabet of W = 4 symbols (numbers) representing the four states, i 2 A 0 ¼ f1; 2; 3; 4g.A SSR sequence in that alphabet might look like x = 421214321431212141� � �.Remember that the q i are normalized weights such that the probability to sample the state j < i conditional on the process being in state i is given by q j /Q i , with Q i ¼ P i j¼1 q j and by q j if the system is in the ground state i = 1 and the system gets driven.One can think of the adjoint SSR alphabet, A n * , as the union of A 0 with the set of new symbols that represent all possible strictly monotonic decreasing sequences on A 0 , i.e., A n * ¼ f1; 2; 3; 4; 5; 6; 7; 8; 9; 10; 11g, where the new symbol "5" represents the sequence 21, "6" stands for 31, "7" for 321, "8" for 41, "9" for 421, "10" for 431, and "11" for 4321.Since we have 7 new symbols, n* = 7 extending the alphabet of original symbols {1, 2, 3, 4} we have a total of 11 symbols in the extended alphabet A n * .The 7 parsing rules producing the new symbols are given by and the map π = π 7 π 6 π 5 π 4 π 3 π 2 π 1 , which maps between messages written in the initial and the adjoint alphabet, can be constructed.We therefore can rewrite our example [10]558� � �, and finally π (7)x = 95 [11][10]558� � �.We now project a distribution function, f, on A n * to a distribution function, g, on A 0 .We remember that the letters 5 to 11 code for the following subsequence: π −1 5 = 21, π −1 6 = 31, π −1 7 = 321, π −1 8 = 41, π −1 9 = 421, π −1 10 = 431, and π −1 11 = 4321, and see that all the new letters with index 5 to 11 represent sequences-junks that contain a 1.That is � h Letter 2 is part of the sequences-junks represented by the extended letters 5, 7, 9, and 11.That is � h Similarly we can find � h i ðzÞ for i = 3 and i = 4.As a consequence the distribution functions g i of the message x in original letters i = 1, � � �, 4 and the distribution function f z of the adjoint message π (7)x in extended letters z = 1, 2, � � �, 11 are given by four equations: where Z is a normalization constant such that 1 ¼ P 4 i¼1 g i .Note that after applying π to a SSR process yields (asymptotically) that f 2 = f 3 = f 4 = 0. We can now express the asymptotic relative frequencies, i.e. the probabilities, f z , in terms of the weights q i on the SSR states i = 1, 2, 3, 4, and get, f 5 = q 2 , f 6 = q 3 q 1 /(q 1 + q 2 ), f 7 = q 3 q 2 /(q 1 + q 2 ), f 8 = q 4 q 1 /(q 1 + q 2 + q 3 ), and so forth.Inserting the expressions for f z in Eq (23) one self-consistently obtains the marginal distribution on the original alphabet as predicted from Eq (21);-note that if q i = 1/W is uniform, then the solution q i /Q i = 1/i is exactly reproducing Zipf's law and for a broad variety of choices for q i one obtains approximate Zipf laws.That means that f fulfils the matching constraints of Eq (8) exactly and therefore also the lift, p * X , of the asymptotic marginal distribution function, g, to f is exact and is given by f ¼ p * X g.That is, we can see in this simple example how the distribution function g over the original alphabet can be predicted from knowing the distribution function f of letters of the extended alphabet.
Since slowly driven SSR processes are in fact also Markov processes, meaning, they are processes where the probability to sample the the state of the process at t + 1 only depends on the state the system at the previous time-step t (compare transition probabilities Eq (18), one can also proof that the respective generalised entropies are actually the adequate information measures in this context.It is well known that the information production of a Markov process can be measured by the so-called conditional entropy, S cond .This is a functional that depends on the probabilities p (2) = p ij , that a symbol j follows symbol i in the process.The SSR entropy on the other hand depends on p (1) = p i is the marginal distribution of the symbols, i.If p (2) is the maximizer of the conditional entropy, or more precisely, the minimizer of the conditional information divergence, and p (1) is the minimizer of the SSR information divergence, then both estimators of entropy, the conditional entropy and the SSR entropy, are identical, S cond ðp ð2Þ Þ � S SSR ðp ð1Þ Þ, for all choices of the system parameters q.For details of the computation, see Text 5 in S1 File.

Discussion
We showed that by identifying the entropy of a process with its information production it is possible to consistently extend the fundamental notion of entropy in statistical physics -Boltzmann entropy-to non-i.i.d.processes and processes that operate out of equilibrium.This is done by identifying isomorphisms that map entire process classes to adjoint representations where processes are i.i.d.The sample space (or alphabet) of the adjoint process is typically much larger than the sample space of the original process.The isomorphisms can be thought of concatenations of parsing rules that map strongly correlated segments in the original process to new symbols.Information production of the adjoint i.i.d.process is quantified by Shannon entropy.Pulling back the entropy measure in the adjoint space to the original sample space and comparing the resulting functional with the Boltzmann entropy (process-specific log multiplicity) establishes the asymptotic equivalence of the notion of generalised entropy and information production.
This provides a comprehensive image that consistently links information theory and the statistical physics of categorial non i.i.d.processes in a context-sensitive way that allows us to consistently associate a notion of entropy, a cross-entropy (representing the constraints of the maximum entropy principle), and an information divergence (or relative entropy) to complex processes.Context-sensitive means that the functional form of the entropy depends on the class of processes considered; the concept of entropy itself, information production from the information theoretic perspective and Boltzmann entropy from the physics perspective, remains untouched.
If an adjoint representation of one process is found, one can find the adjoint representations of an entire class of processes that all de-correlate in their representations over the same adjoint sample space.This means that there exists a natural way how processes implicitly define their own generalization to an entire process class.This is possible because the property of de-correlating over the same adjoint sample space implements an equivalence relation.This has important consequences since these equivalence classes of processes generalize the idea of an ensemble to non-i.i.d.processes which provides a concise way -grounded in first principles-to extend the program of statistical physics to complex processes and their macro variables.

Fig 1 .
Fig 1. Diagram of the relations between of distribution functions and entropies over the sample space and adjoint samples space.We consider a process, X over alphabet A with adjoint process Y = πX over the adjoint alphabet A * .Y is i.i.d. and therefore fully characterized by the marginal distribution of letters it samples from, i.e. asymptotically data y = πx is fully characterized by the relative frequency distribution function f = p(y).F* is the set of all i.i.d.processes over A * .Therefore F π = π −1 F* is the class of processes that naturally generalize X 2 F π .f can be projected to the marginal distribution function g = p(x) = π * p(y) = π * f.Conversely, for a particular process, X, we can lift the distribution function g to the associated adjoint distribution and get p * X g ¼ p * X pðxÞ ¼ pðyÞ ¼ f .Since Y is i.i.d.over the adjoint sample space one can measure information production simply by using Shannon entropy with the adequate Boltzmann factor k Y (adapted to the distribution function f).The commutative diagram therefore defines the process specific generalised entropy S X over the sample space of the process class F π = π −1 F*. https://doi.org/10.1371/journal.pone.0290695.g001 Suppose now that for every t we find a parsing level n*(t) such that y n*(t) (t) is a representation of the data x(t) that cannot be distinguished from an i.i.d.process (which we indicate here by a *).It follows that y n*(t) (t) is entirely determined by its marginal distribution of letters p z (y n*(t) (t)) and obtain asymptotically kHðpðy n * ðtÞ ðtÞÞÞ ' 1 jy n * ðtÞ ðtÞj L min ðy n * ðtÞ ðtÞÞ, where ' means asymptotically identical.With |.| we indicate the length of the sequence of letters in numbers of letters of the underlying alphabet.For instance we have |x(t)| = t and |y n+1 (t)| � |y n (t)|.Then, we can asymptotically measure the information production of the process X to be 1 t LðxðtÞÞ ; ð3Þ with k = 1/log(b), where b is the basis in which information is measured.For bits we typically have b = 2. kH(g) is the MDL in bits per symbol and L min ¼ tkHðgÞ is the minimal number of bits required to encode messages of length t.For data x(t) of length t we find a sequence of representations y n (t) = π(n)x(t).* ðtÞ ðtÞÞÞ : ðzÞ (note that in this notation we identify hℓi f � hℓi Y ) and h � h i i f ¼ P z2A * f z � h i ðzÞ be the expectation values under the distribution f, then, by construction, jxj ¼ P z2A * h z 'ðzÞ, jyj ¼ process, πX, over the adjoint alphabet.The lift operator gives us the minimizer f ðgÞ ¼ p * X g.We find that id ¼ p * p *