The entropy rate of Linear Additive Markov Processes

Bridget Smart; Matthew Roughan; Lewis Mitchell

doi:10.1371/journal.pone.0295074

Abstract

This work derives a theoretical value for the entropy of a Linear Additive Markov Process (LAMP), an expressive but simple model able to generate sequences with a given autocorrelation structure. Our research establishes that the theoretical entropy rate of a LAMP model is equivalent to the theoretical entropy rate of the underlying first-order Markov Chain. The LAMP model captures complex relationships and long-range dependencies in data with similar expressibility to a higher-order Markov process. While a higher-order Markov process has a polynomial parameter space, a LAMP model is characterised only by a probability distribution and the transition matrix of an underlying first-order Markov Chain. This surprising result can be explained by the information balance between the additional structure imposed by the next state distribution of the LAMP model, and the additional randomness of each new transition. Understanding the entropy of the LAMP model provides a tool to model complex dependencies in data while retaining useful theoretical results. To emphasise the practical applications, we use the LAMP model to estimate the entropy rate of the LastFM, BrightKite, Wikispeedia and Reuters-21578 datasets. We compare estimates calculated using frequency probability estimates, a first-order Markov model and the LAMP model, also considering two approaches to ensure the transition matrix is irreducible. In most cases the LAMP entropy rates are lower than those of the alternatives, suggesting that LAMP model is better at accommodating structural dependencies in the processes, achieving a more accurate estimate of the true entropy.

Citation: Smart B, Roughan M, Mitchell L (2024) The entropy rate of Linear Additive Markov Processes. PLoS ONE 19(4): e0295074. https://doi.org/10.1371/journal.pone.0295074

Editor: Viacheslav Kovtun, Institute of Theoretical and Applied Informatics Polish Academy of Sciences: Instytut Informatyki Teoretycznej i Stosowanej Polskiej Akademii Nauk, UKRAINE

Received: July 27, 2023; Accepted: November 13, 2023; Published: April 5, 2024

Copyright: © 2024 Smart et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All simulation files are publicly available from the github.com/bridget-smart/LAMPEntropyEstimates. This is linked in the paper. All data is available publicly at URLs given in the paper. The LASTFM dataset contains the user activity from the music streaming service last.fm for 992 users. Users can choose to listen to stations based on a genre or artist, a particular song, and can share their listening activity. The dataset contains the user, timestamp, artist and song, but here we construct sequences using only the user and artist. Each sequence contains data for a single user, with items representing different artists. Containing 19M items, the data is available at http://ocelma.net/MusicRecommendationDataset/lastfm-1K.html. The BRIGHTKITE dataset contains check-in data from the location-based social networking service BrightKite, where users share their checked-in locations. Each sequence rep- resents a single user, with items representing the locations which the user has checked into. 772,967 locations and 51,406 users are represented in the dataset. The data is available at https://snap.stanford.edu/data/loc-brightkite.html. The WIKISPEEDIA navigation path dataset is comprised of sequences representing a path along Wikipedia links from a condensed version of Wikipedia. These sequences are collected using the online game WIKISPEDIA where users race to navigate between articles. This dataset contains 51,318 complete paths. The dataset is available at https://snap.stanford.edu/data/wikispeedia.html. For the REUTERS dataset we use the Reuters-21578, Distribution 1.0 benchmark corpus, containing newswire articles, with each sequence representing a single article and items representing distinct words. This dataset is available at www.nltk.org/nltk data.

Funding: B. Smart would like to acknowledge the support of a Westpac Future Leaders Scholarship. M. Roughan and L. Mitchell are supported by the Australian Government through the Australian Research Council’s Discovery Projects funding scheme (project DP210103700). There was no additional external funding received for this study.

Competing interests: The authors have declared that no competing interests exist.

Introduction

This work derives a theoretical value for the entropy of a Linear Additive Markov Process (LAMP), an expressive but simple model with a more configurable autocorrelation structure than an equivalently-simple Markov process.

Markov processes are a simple model with wide-ranging applications. They rely on the Markov property, where the next state is only dependent on the current state, and is conditionally independent of the history of the process. The model’s simplicity allows theoretical results to be easily derived and understood, and the flexibility of the model’s framework makes it a useful tool which can be easily extended, and has consequently been used in a vast number of applications.

However, many real sequences have correlations with a longer or more complex history than the simple Markov property allows [1–5]. It is not uncommon to find correlations in real data that are so strong that the process is considered to have long-range dependence, which is defined by the tail of the autocorrelation function decaying so slowly that sums over the tail diverge. Other processes with only short-range correlations may still exhibit complex structure of interest.

Markov processes do have a correlation structure, and this can even be long-range dependent for infinite-state Markov chains [6], but the autocorrelation structure of a Markov process is difficult to tune, reducing their usefulness in applications that require control over the autocorrelation structure.

Higher-order Markov processes allow the next transition to depend on the previous m states, where m describes the order of the process, and thus allow longer correlations to be built into the process. For a mth-order Markov Process with n possible states, modelling these transitions requires (n − 1)n^m parameters, so higher-order models are prone to overfitting as m increases.

The LAMP model, proposed by Kumar et al. [7], overcomes these challenges. The LAMP can fit a measured autocorrelation structure even for sequences that exhibit long-range dependence. An expressive model, the LAMP model has been shown to have comparable performance to deep sequential models despite their relatively small parameter space [7]. While LAMPs are not as expressive as higher-order Markov processes, their straightforward theoretical results and explainability make them a good model for time-series data where the measured correlation structure is of interest.

In this work, we prove that the entropy rate of a LAMP is equal to that of the underlying first-order Markov Chain. The result is surprising because the structure introduced by the LAMP plays no part in entropy rate, despite having a strong impact on the structure of the process, for instance, through the autocorrelation. This foundational result opens up the possibilities for the application of the LAMP model in information-theoretic settings.

A consequence of this result is that the LAMP is able to distinguish between sequences that might look very different due to the presence of long-range dependency (LRD), yet have the same entropy rate—providing a further useful application of the model. We demonstrate the application of the LAMP model in calculating entropy rate estimates by applying it to four standard datasets, and comparing these estimates with other existing methods.

In summary, the contributions of the paper are:

A closed form solution giving the entropy rate of a LAMP process, and showing that the entropy rate is exactly that of the underlying Markov chain regardless of the autocorrelation structure.
Use of this result to construct an estimator which is applied to four publicly available datasets to obtain entropy estimates, close to or lower than values obtained using a first-order Markov model. Estimated entropy values are 4.88 bits/symbol for the Lastfm dataset, 2.49 bits/symbol for the BrightKite dataset, 3.18 bits/symbol for the Wikispeedia dataset and 3.61 bits/symbol for the Reuters dataset.
An exploration of the impacts of forcing ergodicity in a transition matrix when calculating LAMP entropy estimates. Here, we consider two approaches, using only the largest connected component and adding an artificial fully connected node, with entropy estimates similar between each approach. Analysis into the stability of the estimate against the artificial transition probability was undertaken, and a robust value of p_artificial = 2⁻¹⁵ is recommended.

Background

We start by defining a standard discrete-time, time-homogeneous Markov chain X_t, a process which underlies the LAMP model. The definitions and notation are standard, but we include them to ensure there is no confusion.

Let S be a state space, which contains n states. Without loss of generality, we label the states S = [n] = {1, 2, …, n}. We take P to be the n × n stochastic matrix which defines the transitions between states for the underlying first-order Markov process. It follows that the row sums of P are all equal to 1, and each element is non-negative.

In a first-order Markov process, the probability of transitioning from state x_t−1 to x_t is given by, by the Markov property. We also have that the stationary distribution of the Markov process is denoted by the vector π, which is the left eigenvector of P corresponding to the eigenvalue of 1. That is, π satisfies the equation

If P defines an ergodic Markov process, this stationary distribution is unique.

The above results can be generalised to countable state processes, however, a finite-state Markov chain cannot reproduce long-range dependency structures. LAMPs afford control over the correlations within the process without requiring an infinite state space.

Linear Additive Markov Process (LAMP)

A Linear Additive Markov Process (LAMP) is defined using a first-order Markov process and a probability distribution w on the positive integers that we refer to as the LAMP kernel. The LAMP structure is similar to the alternative higher-order Markov model proposed by [8]. Assuming we are at time n − 1 and wish to calculate the next state X_n we adopt the following two stage process:

Select a past state x_n−q with probability w_q. This state is also referred to as the transition state, .
Determine the state X_n = x_n according to the probabilities .

That is, we use the same state transition probabilities as in the underlying Markov chain, but the historical state on which the transition is based is randomly chosen according to w.

In practice, there are some extra details: (i) the distribution w is typically assumed to have finite support [k] = {1, 2, …, k} but it need not be finite, and (ii) the state that is chosen must be from the existing history, which may not extend k steps initially, and so the state chosen as the jumping off point is masked according to x_max{0,n−q}. Depending on the distribution w, this may result in an over-representation of x₀ at initial stages of the sequence. Using appropriate burn-in times and sequence lengths during simulation can mitigate this issue. The value of k gives the order of the LAMP.

An alternative equivalent definition is provided in [7] as follows:

Definition 1 (LAMP Transitions). Given a stochastic matrix P and a distribution w on [k], the kth-order LAMP evolves according to the following transition probabilities:

In their defining paper Kumar et al. [7] provide a number of important results:

There exists a LAMP of order k that cannot be approximated with any constant factor by a higher order Markov process of order k − 1. So the model has more expressive power than a conventional Markov chain, with an exponentially smaller number of parameters.
Under mild regularity conditions (i.e., Ergodicity of the underlying Markov chain) LAMPs possess a limiting equilibrium distribution which is the same as that of the underlying simple Markov chain, i.e., the stationary distribution π.
The choice of the kernel w determines how the history is included into the next state, and so can incorporate correlations extending as far back as desired, including long-range correlations (for k → ∞). It is not stated, but it is plausible that this is closely related to the convolution of the conventional autocorrelation of the underlying Markov chain and the LAMP kernel w. We show this in action in some later examples.
Kumar et al. [7] provide a measure of the speed of convergence to the equilibrium distribution, though interestingly these results only apply where w has a finite fourth moment, and so these convergence estimates do not apply in the case of a long-range dependent process.

The authors also provide an approximate Maximum Likelihood Estimator (MLE) to learn a LAMP from data using alternating estimation of w and P.

Thus LAMPs are a simple and expressive model, characterised by a (first-order Markov) transition matrix and a discrete probability distribution on the non-negative integers. Their efficient parameterisation and simple theoretical results mean they can be easily learned and fit to data.

Kumar et al. [7] provide many useful results for their model but do not provide an explicit formula for the entropy rate of the process, which is the topic of this paper.

Entropy and entropy rate

The entropy rate of a sequence is a quantity that describes the self-information the sequence contains. It also represents an asymptotic lower bound on the lossless compression ratio of the sequence and can be used to measure how predictable a sequence is [9].

For a discrete random variable, X, the Shannon entropy is denoted H, and is defined to be where p(⋅) is the probability distribution of X, and with the convention that 0 log 0 = 0. The joint entropy of n random variables X_i is the obvious extension:

The entropy rate for a process is just the limiting average of the joint entropy,

A standard result [9, Thm 4.2.1] links the entropy rate to an equivalent defined in terms of conditional probabilities, For a stationary process the limits H and H′ exist and are equal [9, Thm 4.2.1], i.e., . That is, for a stationary stochastic process, the conditional entropy rate and per symbol entropy rate of n random variables are equal in the limit.

The result is convenient, particularly for computing the entropy rate of a stationary Markov chain, which is given by [9, Thm 4.2.4] as where P is the probability transition matrix, and π the stationary distribution. The primary result of this paper is that this formula also provides the entropy rate of a LAMP.

The entropy rate of a LAMP

Theorem 1 (Entropy rate of a LAMP). For a LAMP defined by an underlying stationary, first-order Markov Chain with transition matrix P, and kernel distribution w on [k], the entropy rate of the LAMP is (1) where π is the stationary distribution of the Markov chain.

Proof. We begin by considering the entropy rate of n realisations from a LAMP, using known results from information theory about conditional entropy and the entropy rate of a first-order Markov Chain. As noted above, for a stationary Markov chain Assume that at time n − 1 we choose to be the transition state, selecting q according to distribution w. The state at time n depends only on X_n−q and so the conditional entropy where we include q explicitly in the conditioned term for notational convenience. A transition from state X_n−q → X_n in the LAMP is governed by the same probability transition matrix P as the transition in the underlying Markov chain (we use X′ here to denote the underlying Markov chain). Furthermore [7] shows that if the Markov chain corresponding to P is ergodic, then its stationary distribution π′ is also the stationary distribution of the LAMP, so we denote both by π. Hence, the conditional entropy is the entropy rate of the Markov chain, i.e., by [9, Thm 4.2.4]. Note that this decouples the entropy rate from the choice of q (assuming n is large enough that n − q ≥ 0 for all choices of q, or alternatively presuming an infinite history of the stationary process).

Now, we can remove the conditioning on q by noting that the overall conditional entropy is a probabilistically weighted sum over the choices of q, i.e., Noting that w is a proper probability distribution, its sum is 1, and hence the conditional entropy is given by the constant term, and hence the limit is trivial and the result follows. where H(χ) is the entropy of the underlying first-order Markov Chain.

The most noteworthy feature of this result is that LAMP kernel w plays no part in the formula. This is surprising because the kernel clearly affects the autocorrelation structure of the process. Naive intuition suggests that a process with stronger autocorrelations should present less information at each time step, and hence have a smaller entropy rate.

While at first surprising, this result is intuitive as once the conditioning state, is selected, the process transitions in the same manner as a first-order Markov Chain. The randomness introduced by the choice of q, balances with the increased predictability which correlation with the history of the process provides.

Alternatively, we might think of this as increasing the correlations in the process, while also increasing the variance per time step, maintaining a constant entropy rate. However, we work here in the context of a finite alphabet of (not necessarily ordinal) states, so the common notions of variance have been subtly generalised by the LAMP model.

This useful result provides a model able to capture complex dependency structures in data but maintain consistent randomness on a symbol by symbol level.

Model comparisons

To demonstrate the significance of the LAMP model, we will consider the dependency structure of 5 comparable models: a first-order Markov model, and four alternative LAMP models with differing w_k (Fig 1). Each of these models share the same transition matrix illustrated in Fig 1(i). Since they share a transition matrix, these sequences all share the same stationary distribution and entropy rate of 0.45627 bits per symbol, so each model is superficially similar.

Download:

Fig 1. Visualisation of long range dependency within the LAMP model.

(i) A diagram of the underlying transition matrix for each of the models shown in (ii)—(vi). While these models all produce sample paths with differing dependency structures, they all have the same entropy rate, as proven in Theorem 1. In blue, the dependency structure of each model is visualised using Cramér’s V for lags up to 40. This provides a measure of correlation for symbolic variables and is analogous to considering a lagged-cross correlation function. In (ii), the dependency structure is shown for a first-order Markov model. In (iii)—(vi) we see the relative frequency of w_k, or the distribution of the size of the backward step directly influences the shape of the dependency structure. While the Cramér’s V is calculated for a lag of 0 for the first-order Markov Model, it is excluded from plots (iii)-(vi) since the value is much larger due to the Markov Chain’s memoryless property. It was excluded to allow the similarity between the frequency distribution of step size and dependency structure to be more easily seen.

https://doi.org/10.1371/journal.pone.0295074.g001

The dependency structure is measured by calculating Cramér’s V statistic for various lags. Based on Pearson’s Chi-Squared test, the Cramér’s V statistic measures the association between two discrete, nominal variables. To measure the self association for a sequence with a given lag l, pairs of values from the sequence at index i and i + l are used to construct a contingency table, which is used to calculate Cramér’s V statistic. The statistic ranges from 0 to 1, where 0 indicates no correlation, and 1 indicates a perfect relationship between the pairs of variables. Calculating this value for various values of l produces a measure of dependency structure comparable to a lagged cross-correlation plot, but without assigning ordinal value to the state labels.

To illustrate the autocorrelation structure for a standard first-order Markov model, it is shown in plot Fig 1(ii) along with a short sample path. Fig 1(iii)–1(vi) show the Cramér’s V Statistic for the LAMP models in comparison to the choice of w. Note that lag = 0 is not plotted in these cases, as this value will always be 1.

It is evident that the dependency structure mirrors the shape of w_k, demonstrating that the LAMP framework provides a useful model for generating sequences with a desired dependency structure, although it is important to note this structure will repeat and diminish for integer multiples of the original lag, as well as including structure from the original Markov model, so the two are not identical. However, there is an evident ability to tune the autocorrelation of a LAMP, while preserving other properties such as the stationary distribution and entropy rate.

Empirical evaluation

Following the same approach as the original LAMP paper [7], we fit and estimate Shannon Entropy rates on four publicly available datasets, Lastfm, BrightKite, Wikispeedia and Reuters. Each dataset is comprised of a number of distinct sequences which represent the activity of an individual or a single process. We compare the entropy rate estimates of (1) an empirical estimator, (2) the estimator obtained by fitting a Markov chain, and (3) the entropy estimate obtained by fitting a LAMP.

Our code is available at github.com/bridget-smart/LAMPEntropyEstimates and the links to each dataset are provided in the text below.

The Lastfm dataset contains the user activity from the music streaming service last.fm for 992 users. Users can choose to listen to stations based on a genre or artist, a particular song, and can share their listening activity. The dataset contains the user, timestamp, artist and song, but here we construct sequences using only the user and artist. Each sequence contains data for a single user, with items representing different artists. Containing 19M items, the data is available at http://ocelma.net/MusicRecommendationDataset/lastfm-1K.html.

The BrightKite dataset contains check-in data from the location-based social networking service BrightKite, where users share their checked-in locations. Each sequence represents a single user, with items representing the locations which the user has checked into. 772,967 locations and 51,406 users are represented in the dataset. The data is available at https://snap.stanford.edu/data/loc-brightkite.html.

The Wikispeedia navigation path dataset is comprised of sequences representing a path along Wikipedia links from a condensed version of Wikipedia. These sequences are collected using the online game Wikispedia where users race to navigate between articles. This dataset contains 51,318 complete paths. The dataset is available at https://snap.stanford.edu/data/wikispeedia.html.

For the Reuters dataset we use the Reuters-21578, Distribution 1.0 benchmark corpus, containing newswire articles, with each sequence representing a single article and items representing distinct words. This dataset is available at www.nltk.org/nltk_data.

Following the same process as in [7], consecutive repeated values within each path repeated values are removed to prevent self-loops from appearing in the transition matrices. Consecutive repeated values are also not meaningful in our datasets, as they do not contain temporal data. We also replace values which appear less than 10 times, or less than 50 times for the Lastfm dataset with a unique token, representing a less frequently visited state. This preprocessing ensured our model fit was consistent with the original paper, and across both the LAMP and first-order Markov models.

LAMP models are fit to each dataset using the original LAMP code available at github.com/google-research/google-research/tree/master/lamp. Modified code used to obtain the LAMP transition matrices for this paper is available at github.com/bridget-smart/modified_lamp. The code used to obtain the entropy estimates is available at github.com/bridget-smart/LAMPEntropyEstimates.

While a modified version of the author’s original code was used to fit the LAMP model, care was taken to exactly reproduce the preprocessing steps outlined in Kumar et al. [7] when fitting the remaining models. The code from Kumar et al. [7] fits the LAMP model using a subset of the data and a cross-validation technique to overcome computational and time constraints. Therefore, the estimate using the LAMP model was fit on a smaller subset of the data than other estimates. Discrepancies in the transition matrix size between the first-order Markov and LAMP models indicate the differences caused by this.

Stationary distribution derivations are valid where the transition matrix is irreducible, but the data is imperfect, and this condition is not true for all transition matrices estimated from these datasets. We overcome this in two ways: (1) by only considering the largest strongly-connected component of the graph generated by considering the transition matrix as a directed network; and (2) by artificially connecting each distinct communicating class to the first state with transition weight p_artificial.

The first approach assumes that the largest connected component of the transition matrix is representative of the chain’s behaviour. To assess this we consider how many nodes are excluded from that component. For the first-order Markov models, the number of nodes ignored when only considering the largest fully connected component were 1, 1255, 68, 2 for each of the four datasets, Lastfm, BrightKite, Wikispeedia and Reuters respectively. The total number of nodes in each of these models were 23870, 43596, 3425 and 8465 respectively. The BrightKite dataset has the largest proportion of nodes removed (0.02879), which suggests the component can be representative.

In the LAMP models 237 (of 17766 for Lastfm), 2591 (of 25348 for BrightKite), 2775 (of 3504 for Wikispeedia) and 3094 (of 6822 for Reuters) nodes were ignored while only considering the largest fully connected component to obtain the entropy estimates for the LAMP models. These proportions are generally higher than those for the first-order Markov model, ranging from 0.7920 to 0.01334, which may result in more instability within the estimated entropy values.

The second approach forces irreducibility by ensuring that the chain mixes, and all states can be visited. It, and in particular a discussion of how to choose p_artificial is given in the Supporting information, but a brief description is given here. This approach uses the assumption that it is possible to communicate between each possible pair of states, an assumption which is reasonable for most of our datasets, as, for instance, it is possible for users to listen to any pair of artists, but rare transitions are not likely to be observed, and so including a small transition probability to connect the network is reasonable. Artificially inducing transitions between states which are otherwise disconnected, modifies the behaviour of the chain. It follows that these artificial links should have a small weight since these transitions have not been observed within the data sample suggesting they are rare, but the weight of these links needs to be large enough to ensure mixing can occur, stable estimates can be calculated and errors introduced by computer precision are avoided. The choice of p = 2⁻¹⁵ and p = 2⁻¹⁰ is discussed in the Supporting information.

The results are shown in Table 1. The estimators referred to in the table are defined in detail below. At a high level:

Shannon Empirical estimators use where is an estimated probability distribution. We consider three cases, defined below.
Markov estimates use the standard Markov-chain entropy given in Eq 1, with empirical estimates of the transition and stationary probabilities. They do so either on the Largest Connected Component (CC) or on a chain with artificially induced irreducibility as described earlier.
LAMP estimates use the LAMP entropy, which is also given in Eq 1, with empirical estimates of the transition and stationary probabilities, but note that the transition matrices for LAMP models will differ from those of the Markov case. These estimators also use either on the Largest Connected Component (CC) or on a chain with artificially induced irreducibility as described earlier.

Download:

Table 1. Entropy estimates for each of the four datasets using seven estimation methods.

Two main approaches are used to ensure transition matrices are ergodic: only considering the largest connected component; and adding an artificial state with transition weight p_artificial. Both of these approaches give similar results for each of the four datasets.

https://doi.org/10.1371/journal.pone.0295074.t001

Shannon estimators are applied to a probability distribution, but the structure of the data provides several alternative distributions to consider.

The data used contains multiple sample paths, i.e., we have a set of n subsequences , each composed of m_i items, for a total of N realisations across all subsequences.

The Sequence Level estimator uses the frequency of each item across all subsequences to estimate its probability of occurrence, i.e., where is an indicator variable which takes the value 1 if the x = y_i.

The Path Level estimator uses the frequency of each symbol in a sequence to estimate a simple Shannon entropy for each sequence, and then averages over these, i.e., where is an indicator variable which takes the value 1 if x = s_ij, and and

The Stationary Distribution estimator uses as its probability distribution the stationary distribution as estimated through one of the Markov models.

The results shown in Table 1 provide several insights:

The Shannon sequence-level estimator grossly overestimates entropy values for all cases. The most likely scenario is that there just are not enough sampled sequences to properly represent the underlying distribution, so the calculation is essentially computing the entropy of a distribution close to the uniform distribution, i.e., near the maximum possible value.
The Shannon entropy of the stationary distribution is also a fair over-estimate indicating that the transition structure is important in these sequences.
The Shannon path-level estimator produces some overestimates, but also a very low value for the BrightKite data. This anomalous result may arise here for several reasons, but it is noteworthy that the Shannon estimator can be biased on short data, and as it is being applied here to each subsequence and averages, we should expect bias in the overall estimate. At best it appears to be a sensitive and hard to interpret measure.
There is close agreement for entropy estimators using the largest connected component, and induced irreducibility. This similarity is most pronounced for the LAMP models. That suggests that the manner in which we corrected for reducible processes does not matter, but that the LAMP models are also more robust to details such as this.
Lower estimated entropies are often indicative of a better model (for the purpose of estimating entropy) that captures more of the predictability of the sequence. The estimates generated using the LAMP models are the lowest estimates for the Wikispeedia and Reuters datasets. The lower entropy estimate values imply that more information about the sequence is captured by the LAMP model than the underlying first-order Markov model. This implies that the model captures more information about the behaviour of the system, so the generated sequence is more predictable. By better capturing the long-term dependency structures in the data, we achieve a more meaningful model. The estimates obtained using the LAMP model appear appropriate for all datasets. This supports the claims of [7], that the LAMP model provides a better, succinct, representation of these datasets, but our claim is slightly stronger in the context of entropy because they compare against k-order and adaptive Markov chains.
For the Lastfm and BrightKite datasets, the estimates obtained using the first-order Markov models are lower than those obtained with the LAMP model. These two datasets represent artist sequences on a music streaming service and checked in user locations and it is possible that these sequences can be accurately captured without the LAMP structure. This is somewhat in contrast with [7], but note that this finding is only in respect to the entropy estimates and there are other facets of a model that are also important.

Conclusion

By deriving a theoretical result for the entropy rate of a linear additive Markov Process, we extend the versatility of LAMPs to information-centic applications and derive a surprising equality between the entropy rate of a LAMP and the underlying first-order Markov Chain. A consequence of this is that realisations of processes having different long-range dependency structures—and sample paths which are therefore quite different from each other—might nonetheless be indistinguishable based on entropy rates. This is because the entropy rates depend only on the underlying transition matrix structure, and not the long-range dependence structure. The LAMP however provides a means to distinguish between such processes, by visualisation of the long-range dependency structure. We demonstrate this via numerical simulations and use of the Cramer’s V statistic. While LAMPs exhibit long-term dependency structures, their symbol-by-symbol dependency is low due to the randomness introduced by the distribution over the choice of conditioning state. Linear Additive Markov Processes present an intuitive yet useful extension of traditional Markov chain models to incorporate long-range dependence. We demonstrate how the LAMP model can be used to obtain entropy estimates on four datasets, compare the estimates and explore two methods for ensuring ergodicity of the transition matrix. This work demonstrates that LAMPs can be understood within a traditional information-theoretic framework, and many further results are possible in this context. Deriving an expression for the entropy of a LAMP model reveals new and exciting applications to settings with well-known or observable long-range dependency structures, provides a modelling option where higher-order dynamics are necessary but there is highly-correlated or limited data, and provides the opportunity to integrate the LAMP model into information-theoretic estimation tools. Future work will explore these applications and application of LAMP models to understanding long-range phenomena such as in social media communications [5], to provide enhancements to network traffic measurements such as [3], to derive online estimators for entropy such as in [4], or as an alternative to the context tree weighting based (CTW-based) entropy estimator [10].

Supporting information

S1 Fig. Convergence for the entropy estimates on various datasets.

Plots to show convergence for the normalised entropy estimate value against log₂ p_artificial, which was used to ensure ergodicity. (i) Shows hyperparameter sensitivity for the first-order Markov models and (ii) for the LAMP models. Each plot shows the effect of the weight of this artificial link on both the first-order Markov model estimate and the estimate obtained using the LAMP model for each dataset. We aim to find a region where the estimates are insensitive to the choice of hyperparameter. A dashed black line on each plot indicates the value when the artificial link weight is 2^-15. This value was chosen as a global value, since it is a reasonable choice for all dataset model combinations, apart from the BrightKite dataset first-order Markov model, when a value of 2^-10 was used to obtain the final estimate. This alternative value is indicated by a grey dashed line. The convergence curve for the WIKISPEEDIA and REUTERS datasets overlapped for the first-order Markov model, so the values for the REUTERS dataset was offset by +0.04 for visualisation. Small vertical offset was also added to the REUTERS and BRIGHTKITE datasets in the LAMP model visualisation (+0.01 and -0.005 respectively).

https://doi.org/10.1371/journal.pone.0295074.s001

(PDF)

S1 File.

https://doi.org/10.1371/journal.pone.0295074.s002

(ZIP)

References

1. Leland W, Taqqu M, Willinger W, Wilson D. On the Self-Similar Nature of Ethernet Traffic (Extended Version). IEEE/ACM Transactions on Networking. 1994;2(1):1–15.
- View Article
- Google Scholar
2. Beran J. Statistics for Long-Memory Processes. Chapman and Hall, New York; 1994.
3. Nguyen H, Roughan M. Rigorous Statistical Analysis of Internet Loss Measurements. IEEE/ACM Transactions on Networking. 2013;21(3):734–745.
- View Article
- Google Scholar
4. Roughan M, Veitch D, Abry P. On-line estimation of the parameters of long-range dependence. In: IEEE GLOBECOM. vol. 6; 1998. p. 3716–3721.
5. Mathews P, Mitchell L, Nguyen G, Bean N. The nature and origin of heavy tails in retweet activity. In: Proceedings of the 26th International Conference on World Wide Web Companion; 2017. p. 1493–1498.
6. Carpio K, Daley D. Long-range dependence of Markov chains in discrete time on countable state space. Journal of Applied Probability. 2007;44(4):1047–1055.
- View Article
- Google Scholar
7. Kumar R, Raghu M, Sarlós T, Tomkins A. Linear Additive Markov Processes. In: Proceedings of the 26th International Conference on World Wide Web. WWW’17. Republic and Canton of Geneva, CHE: International World Wide Web Conferences Steering Committee; 2017. p. 411–419. Available from: https://doi.org/10.1145/3038912.3052644.
8. Raftery AE. A Model for High-Order Markov Chains. Journal of the Royal Statistical Society Series B (Methodological). 1985;47(3):528–539.
- View Article
- Google Scholar
9. Cover T, Thomas J. Elements of information theory. Wiley series in telecommunications. New York: Wiley; 1991.
10. Willems FM, Shtarkov YM, Tjalkens TJ. The context-tree weighting method: Basic properties. IEEE transactions on information theory. 1995;41(3):653–664.
- View Article
- Google Scholar

[ref1] 1. Leland W, Taqqu M, Willinger W, Wilson D. On the Self-Similar Nature of Ethernet Traffic (Extended Version). IEEE/ACM Transactions on Networking. 1994;2(1):1–15.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Beran J. Statistics for Long-Memory Processes. Chapman and Hall, New York; 1994.

[ref3] 3. Nguyen H, Roughan M. Rigorous Statistical Analysis of Internet Loss Measurements. IEEE/ACM Transactions on Networking. 2013;21(3):734–745.
View Article
Google Scholar

[6] View Article

[7] Google Scholar

[ref4] 4. Roughan M, Veitch D, Abry P. On-line estimation of the parameters of long-range dependence. In: IEEE GLOBECOM. vol. 6; 1998. p. 3716–3721.

[ref5] 5. Mathews P, Mitchell L, Nguyen G, Bean N. The nature and origin of heavy tails in retweet activity. In: Proceedings of the 26th International Conference on World Wide Web Companion; 2017. p. 1493–1498.

[ref6] 6. Carpio K, Daley D. Long-range dependence of Markov chains in discrete time on countable state space. Journal of Applied Probability. 2007;44(4):1047–1055.
View Article
Google Scholar

[11] View Article

[12] Google Scholar

[ref7] 7. Kumar R, Raghu M, Sarlós T, Tomkins A. Linear Additive Markov Processes. In: Proceedings of the 26th International Conference on World Wide Web. WWW’17. Republic and Canton of Geneva, CHE: International World Wide Web Conferences Steering Committee; 2017. p. 411–419. Available from: https://doi.org/10.1145/3038912.3052644.

[ref8] 8. Raftery AE. A Model for High-Order Markov Chains. Journal of the Royal Statistical Society Series B (Methodological). 1985;47(3):528–539.
View Article
Google Scholar

[15] View Article

[16] Google Scholar

[ref9] 9. Cover T, Thomas J. Elements of information theory. Wiley series in telecommunications. New York: Wiley; 1991.

[ref10] 10. Willems FM, Shtarkov YM, Tjalkens TJ. The context-tree weighting method: Basic properties. IEEE transactions on information theory. 1995;41(3):653–664.
View Article
Google Scholar

[19] View Article

[20] Google Scholar

Figures