The entropy rate of Linear Additive Markov Processes

This work derives a theoretical value for the entropy of a Linear Additive Markov Process (LAMP), an expressive but simple model able to generate sequences with a given autocorrelation structure. Our research establishes that the theoretical entropy rate of a LAMP model is equivalent to the theoretical entropy rate of the underlying first-order Markov Chain. The LAMP model captures complex relationships and long-range dependencies in data with similar expressibility to a higher-order Markov process. While a higher-order Markov process has a polynomial parameter space, a LAMP model is characterised only by a probability distribution and the transition matrix of an underlying first-order Markov Chain. This surprising result can be explained by the information balance between the additional structure imposed by the next state distribution of the LAMP model, and the additional randomness of each new transition. Understanding the entropy of the LAMP model provides a tool to model complex dependencies in data while retaining useful theoretical results. To emphasise the practical applications, we use the LAMP model to estimate the entropy rate of the LastFM, BrightKite, Wikispeedia and Reuters-21578 datasets. We compare estimates calculated using frequency probability estimates, a first-order Markov model and the LAMP model, also considering two approaches to ensure the transition matrix is irreducible. In most cases the LAMP entropy rates are lower than those of the alternatives, suggesting that LAMP model is better at accommodating structural dependencies in the processes, achieving a more accurate estimate of the true entropy.


II. INTRODUCTION
Markov processes are a simple model with wide-ranging applications.They rely on the Markov property, where the next state is only dependent on the current state, and is conditionally independent of the history of the process.The model's simplicity allows theoretical results to be easily derived and understood, and the flexibility of the model's framework makes it a useful tool which can be easily extended, and has consequently been used in a vast number of applications.B. Smart, M. Roughan and L.Mitchell are with the University of Adelaide.Email {bridget.smart,matthew.roughan,lewis.mitchell}@adelaide.edu.auB. Smart would like to acknowledge the support of a Westpac Future Leaders Scholarship.M. Roughan and L. Mitchell are supported by the Australian Government through the Australian Research Council's Discovery Projects funding scheme (project DP210103700).
However, many real sequences have correlations with a longer or more complex history than the simple Markov property allows [LTWW94], [Ber94].It is not uncommon to find correlations in real data that are so strong that the process is considered to have long-range dependence, which is defined by the tail of the autocorrelation function decaying so slowly that sums over the tail diverge.Other processes with only short-range correlations may still exhibit complex structure of interest.
Markov processes do have a correlation structure, and this can even be long-range dependent for infinite-state Markov chains [CD07], but the autocorrelation structure of a Markov process is difficult to tune, reducing their usefulness in applications that require control over the autocorrelation structure.
Higher-order Markov processes allow the next transition to depend on the previous m states, where m describes the order of the process, and thus allow longer correlations to be built into the process.For a mth-order Markov Process with n possible states, modelling these transitions requires (n − 1)n m parameters, so higher-order models are prone to overfitting as m increases.
The Linear Additive Markov Process (LAMP), proposed by Kumar et al. [KMST17], overcomes these challenges.The LAMP can fit a measured autocorrelation structure even for sequences that exhibit long-range dependence.An expressive model, the LAMP model has been shown to have comparable performance to deep sequential models despite their relatively small parameter space [KMST17].While LAMPs are not as expressive as higher-order Markov processes, their straightforward theoretical results and explainability make them a good model for time-series data where the measured correlation structure is of interest.
This work proves that the entropy rate of a LAMP is equal to that of the underlying first-order Markov Chain.The result is surprising because the structure introduced by the LAMP plays no part in entropy rate, despite having a strong impact on the structure of the process, for instance, through the autocorrelation.A consequence of this result is that the LAMP is able to distinguish between sequences that might look very different due to the presence of long-range dependency (LRD), yet have the same entropy rate -providing a further useful application of the model.
In summary, the contributions of the paper are: 1) A closed form solution giving the entropy rate of a LAMP process, and showing that the entropy rate is exactly that of the underlying Markov chain regardless of the autocorrelation.
2) Use of this result to construct an estimator which is applied to four publicly available datasets to obtain entropy estimates, close to or lower than values obtained using a first-order Markov model.Estimated entropy values are 4.88 bits/symbol for the Lastfm dataset, 2.49 bits/symbol for the BrightKite dataset, 3.18 bits/symbol for the Wikispeedia dataset and 3.61 bits/symbol for the Reuters dataset.3) An exploration of the impacts of forcing ergodicity in a transition matrix on entropy estimates.This work considers two approaches, using only the largest connected component and adding an artificial fully connected node, with entropy estimates similar between each approach.Analysis into the stability of the estimate against the artificial transition probability was undertaken, and a robust value of p artificial = 2 −15 is recommended.

III. BACKGROUND
We start by defining a standard discrete-time, timehomogeneous Markov chain X t , a process which underlies the LAMP model.The definitions and notation are standard, but we include them to ensure there is no confusion.
Let S be a state space, which contains n states.Without loss of generality, we label the states S = [n] = {1, 2, . . ., n}.We take P to be the n × n stochastic matrix which defines the transitions between states for the underlying first-order Markov process.It follows that the row sums of P are all equal to 1, and each element is non-negative.
In a first-order Markov process, the probability of transitioning from state x t−1 to x t is given by, by the Markov property.We also have that the stationary distribution of the Markov process is denoted by the vector π, which is the left eigenvector of P corresponding to the eigenvalue of 1.That is, π satisfies the equation πP = π.
If P defines an ergodic Markov process, this stationary distribution is unique.
The above results can be generalised to countable state processes, however, a finite-state Markov chain cannot reproduce long-range dependency structures.LAMPs afford control over the correlations within the process without requiring an infinite state space.

A. Linear Additive Markov Process (LAMP)
A Linear Additive Markov Process (LAMP) is defined using a first-order Markov process and a probability distribution w on the positive integers that we refer to as the LAMP kernel.Assuming we are at time n − 1 and wish to calculate the next state X n we adopt the following two stage process: 1) Select a past state x n−q with probability w q .This state is also referred to as the transition state, X Tn .
2) Determine the state X n = x n according to the probabilities P xn−q,xn .That is, we use the same state transition probabilities as in the underlying Markov chain, but the historical state on which the transition is based is randomly chosen according to w.
In practice, there are some extra details: (i) the distribution w is typically assumed to have finite support [k] = {1, 2, . . ., k} but it need not be finite, and (ii) the state that is chosen must be from the existing history, which does not extend k steps initially, and so the state chosen as the jumping off point is masked according to x max{0,n−q} .The value of k gives the order of the LAMP.
An alternative equivalent definition is provided in [KMST17] as follows: Definition III.1 (LAMP Transitions).Given a stochastic matrix P and a distribution w on [k], the kth-order LAMP evolves according to the following transition probabilities: In their defining paper Kumar et al. [KMST17] provide a number of important results: 1) There exists a LAMP of order k that cannot be approximated with any constant factor by a higher order Markov process of order m − 1.So the model has more expressive power than a conventional Markov chain, with an exponentially smaller number of parameters.2) Under mild regularity conditions (i.e., Ergodicity of the underlying Markov chain) LAMPs possess a limiting equilibrium distribution which is the same as that of the underlying simple Markov chain, i.e., the stationary distribution π.
3) The choice of the kernel w determines how the history is included into the next state, and so can incorporate correlations extending as far back as desired, including long-range correlations (for k → ∞).It is not stated, but it is plausible that this is closely related to the convolution of the conventional autocorrelation of the underlying Markov chain and the LAMP kernel w.We show this in action in some later examples.4) Kumar et al. [KMST17] provide a measure of the speed of convergence to the equilibrium distribution, though interestingly these results only apply where w has a finite fourth moment, and so these convergence estimates do not apply in the case of a long-range dependent process.The authors also provide an approximate Maximum Likelihood Estimator (MLE) to learn a LAMP from data using alternating estimation of w and P .
Thus LAMPs are a simple and expressive model, characterised by a (first-order Markov) transition matrix and a discrete probability distribution on the non-negative integers.Their efficient parameterisation and simple theoretical results mean they can be easily learned and fit to data.
Kumar et al. [KMST17] provide many useful results for their model but do not provide an explicit formula for the entropy rate of the process, which is the topic of this paper.

B. Entropy and Entropy Rate
The entropy rate of a sequence is a quantity that describes the self-information the sequence contains.It also represents an asymptotic lower bound on the lossless compression ratio of the sequence and can be used to measure how predictable a sequence is [CT91].
For a discrete random variable, X, the Shannon entropy is denoted H, and is defined to be where p(•) is the probability distribution of X, and with the convention that 0 log 0 = 0.The joint entropy of n random variables X i is the obvious extension: The entropy rate for a process X is just the limiting average of the joint entropy, A standard result [CT91, Thm 4.2.1]links the entropy rate to an equivalent defined in terms of conditional probabilities, For a stationary process the limits H and H exist and are equal [CT91, Thm 4.2.1],i.e., H(X ) = H (X ).That is, for a stationary stochastic process, the conditional entropy rate and per symbol entropy rate of n random variables are equal in the limit.
The result is convenient, particularly for computing the entropy rate of a stationary Markov chain, which is given by [CT91, Thm 4.2.4] as where P is the probability transition matrix, and π the stationary distribution.The primary result of this paper is that this formula also provides the entropy rate of a LAMP.

IV. THE ENTROPY RATE OF A LAMP
Theorem IV.1 (Entropy rate of a LAMP).For a LAMP defined by an underlying stationary, first-order Markov Chain with transition matrix P , and kernel distribution w on [k], the entropy rate of the LAMP is where π is the stationary distribution of the Markov chain.
Proof.We begin by considering the entropy rate of n realisations from a LAMP, using known results from information theory about conditional entropy and the entropy rate of a firstorder Markov Chain.As noted above, for a stationary Markov chain H(X ) = H (X ).
Assume that at time n − 1 we choose X Tn = X n−q to be the transition state according to distribution w.The state at time n depends only on X n−q and so the conditional entropy where we include q explicitly in the conditioned term for notational convenience.A transition from state X n−q → X n in the LAMP is governed by the same probability transition matrix P as the transition X n−1 → X n in the underlying Markov chain (we use X here to denote the underlying Markov chain).Furthermore [KMST17] shows that if the Markov chain corresponding to P is ergodic, then its stationary distribution π is also the stationary distribution of the LAMP, so we denote both by π.Hence, the conditional entropy is the entropy rate of the Markov chain, i.e., by [CT91, Thm 4.2.4].Note that this decouples the entropy rate from the choice of q (assuming n is large enough that n − q ≥ 0 for all choices of q, or alternatively presuming an infinite history of the stationary process).Now, we can remove the conditioning on q by noting that the overall conditional entropy is a probabilistically weighted sum over the choices of q, i.e., Noting that w is a proper probability distribution, its sum is 1, and hence the conditional entropy is given by the constant term, and hence the limit is trivial and the result follows.

H(χ
where H(χ) is the entropy of the underlying first-order Markov Chain.
The most noteworthy feature of this result is that LAMP kernel w plays no part in the formula.This is surprising because the kernel clearly affects the autocorrelation structure of the process.Naive intuition suggests that a process with stronger autocorrelations should present less information at each time step, and hence have a smaller entropy rate.
While at first surprising, this result is intuitive as once the conditioning state, X Tn is selected, the process transitions in the same manner as a first-order Markov Chain.The randomness introduced by the choice of k, balances with the increased predictability which correlation with the history of the process provides.
Alternatively, we might think of this as increasing the correlations in the process, while also increasing the variance per time step, maintaining a constant entropy rate.However, we work here in the context of a finite alphabet of (not necessarily ordinal) states, so the common notions of variance have been subtly generalised by the LAMP model.While these models all produce sample paths with differing dependency structures, they all have the same entropy rate, as proven in Theorem 4.1.In blue, the dependency structure of each model is visualised using Cramér's V for lags up to 40.This provides a measure of correlation for symbolic variables and is analogous to considering a lagged-cross correlation function.In (ii), the dependency structure is shown for a first-order Markov model.In (iii) -(vi) we see the frequency of w k , or the distribution of the size of the backward step directly influences the shape of the dependency structure.While the Cramér's V is calculated for a lag of 0 for the first-order Markov Model, it is excluded from plots (iii)-(vi) since the value is much larger due to the Markov Chain's memoryless property.It was excluded to allow the similarity between the frequency distribution of step size and dependency structure to be more easily seen.
This useful result provides a model able to capture complex dependency structures in data but maintain consistent randomness on a symbol by symbol level.

V. MODEL COMPARISONS
To demonstrate the significance of the LAMP model, we will consider the dependency structure of 5 comparable models: a first-order Markov model, and four alternative LAMP models with differing w k (Figure 1).Each of these models share the same transition matrix illustrated in Figure 1 (i).Since they share a transition matrix, these sequences all share the same stationary distribution and entropy rate of 0.45627 bits per symbol, so each model is superficially similar.
The dependency structure is measured by calculating Cramér's V statistic for various lags.Based on Pearson's Chi-Squared test, the Cramér's V statistic measures the association between two discrete, nominal variables.To measure the self association for a sequence with a given lag l, pairs of values from the sequence at index i and i + l are used to construct a contingency table, which is used to calculate Cramér's V statistic.The statistic ranges from 0 to 1, where 0 indicates no correlation, and 1 indicates a perfect relationship between the pairs of variables.Calculating this value for various values of l produces a measure of dependency structure comparable to a lagged cross-correlation plot, but without assigning ordinal value to the state labels.
To illustrate the autocorrelation structure for a standard firstorder Markov model, it is shown in plot Figure 1 (ii) along with a short sample path.Figure 1 (iii)-(vi) show the Cramér's V Statistic for the LAMP models in comparison to the choice of w.Note that lag = 0 is not plotted in these cases, as this value will always be 1.
It is evident that the dependency structure mirrors the shape of w k , demonstrating that the LAMP framework provides a useful model for generating sequences with a desired dependency structure, although it is important to note this structure will repeat and diminish for integer multiples of the original lag, as well as including structure from the original Markov model, so the two are not identical.However, there is an evident ability to tune the autocorrelation of a LAMP, while preserving other properties such as the stationary distribution and entropy rate.

VI. EMPIRICAL EVALUATION
Following the same approach as the original LAMP paper [KMST17], we fit and estimate Shannon Entropy rates on four publicly available datasets, LASTFM, BRIGHTKITE, WIKISPEEDIA and REUTERS.Each dataset is comprised of a number of distinct sequences which represent the activity of an individual or a single process.We compare the entropy rate estimates of (1) an empirical estimator, (2) the estimator obtained by fitting a Markov chain, and (3) the entropy estimate obtained by fitting a LAMP.
Our code is available at github.com/bridget-smart/LAMPEntropyEstimates and the links to each dataset are provided in the text below.
The LASTFM dataset contains the user activity from the music streaming service last.fmfor 992 users.Users can choose to listen to stations based on a genre or artist, a particular song, and can share their listening activity.The dataset contains the user, timestamp, artist and song, but here we construct sequences using only the user and artist.Each sequence contains data for a single user, with items representing different artists.Containing 19M items, the data is available at http://ocelma.net/MusicRecommendationDataset/ lastfm-1K.html.
The BRIGHTKITE dataset contains check-in data from the location-based social networking service BrightKite, where users share their checked-in locations.Each sequence represents a single user, with items representing the locations which the user has checked into.772,967 locations and 51,406 users are represented in the dataset.The data is available at https://snap.stanford.edu/data/loc-brightkite.html.
The WIKISPEEDIA navigation path dataset is comprised of sequences representing a path along Wikipedia links from a condensed version of Wikipedia.These sequences are collected using the online game WIKISPEDIA where users race to navigate between articles.This dataset contains 51,318 complete paths.The dataset is available at https://snap.stanford.edu/data/wikispeedia.html.
For the REUTERS dataset we use the Reuters-21578, Distribution 1.0 benchmark corpus, containing newswire articles, with each sequence representing a single article and items representing distinct words.This dataset is available at www.nltk.org/nltkdata.
Following the same process as in [KMST17], consecutive repeated values within each path repeated values are removed to prevent self-loops from appearing in the transition matrices.Consecutive repeated values are also not meaningful in our datasets, as they do not contain temporal data.We also replace values which appear less than 10 times, or less than 50 times for the LASTFM dataset with a unique token, representing a less frequently visited state.This preprocessing ensured our model fit was consistent with the original paper, and across both the LAMP and first-order Markov models.
LAMP models are fit to each dataset using the original LAMP code available at github.com/google-research/google-research/tree/master/lamp.Modified code used to obtain the LAMP transition matrices for this paper is available at github.com/bridget-smart/modifiedlamp.The code used to obtain the entropy estimates is available at github.com/bridget-smart/LAMPEntropyEstimates.
While a modified version of the author's original code was used to fit the LAMP model, care was taken to exactly reproduce the preprocessing steps outlined in Kumar et al. [KMST17] when fitting the remaining models.The code from Kumar et al. [KMST17] fits the LAMP model using a subset of the data and a cross-validation technique to overcome computational and time constraints.Therefore, the estimate using the LAMP model was fit on a smaller subset of the data than other estimates.Discrepancies in the transition matrix size between the first-order Markov and LAMP models indicate the differences caused by this.2) The Shannon entropy of the stationary distribution is also a fair over-estimate indicating that the transition structure is important in these sequences.
3) The Shannon path-level estimator produces some overestimates, but also a very low value for the BrightKite data.This anomalous result may arise here for several reasons, but it is noteworthy that the Shannon estimator can be biased on short data, and as it is being applied here to each subsequence and averages, we should expect bias in the overall estimate.At best it appears to be a sensitive and hard to interpret measure.4) There is close agreement for entropy estimators using the largest connected component, and induced irreducibility.This similarity is most pronounced for the LAMP models.That suggests that the manner in which we corrected for reducible processes does not matter, but that the LAMP models are also more robust to details such as this.5) Lower estimated entropies are often indicative of a better model (for the purpose of estimating entropy) that captures more of the predictability of the sequence.The estimates generated using the LAMP models are the lowest estimates for the WIKISPEEDIA and REUTERS datasets.The lower entropy estimate values imply that more information about the sequence is captured by the LAMP model than the underlying first-order Markov model.This implies that the model captures more information about the behaviour of the system, so the generated sequence is more predictable.By better capturing the long-term dependency structures in the data, we achieve a more meaningful model.The estimates obtained using the LAMP model appear appropriate for all datasets.This supports the claims of [KMST17], that the LAMP model provides a better, succinct, representation of these datasets, but our claim is slightly stronger in the context of entropy because they compare against k-order and adaptive Markov chains.6) For the LASTFM and BRIGHTKITE datasets, the estimates obtained using the first-order Markov models are lower than those obtained with the LAMP model.These two datasets represent artist sequences on a music streaming service and checked in user locations and it is possible that these sequences can be accurately captured without the LAMP structure.This is somewhat in contrast with [KMST17], but note that this finding is only in respect to the entropy estimates and there are other facets of a model that are also important.

VII. CONCLUSION
By deriving a theoretical result for the entropy rate of a linear additive Markov Process, we extend the versatility of LAMPs to information-centic applications and derive a surprising equality between the entropy rate of a LAMP and the underlying first-order Markov Chain.A consequence of this is that realisations of processes having different longrange dependency structures -and sample paths which are therefore quite different from each other -might nonetheless be indistinguishable based on entropy rates.This is because the entropy rates depend only on the underlying transition matrix structure, and not the long-range dependence structure.The LAMP however provides a means to distinguish between such processes, by visualisation of the long-range dependency structure.We demonstrate this via numerical simulations and use of the Cramer's V statistic.While LAMPs exhibit long-term dependency structures, their symbol-bysymbol dependency is low due to the randomness introduced by the distribution over the choice of conditioning state.Linear Additive Markov Processes present an intuitive yet useful extension of traditional Markov chain models to incorporate long-range dependence.This work demonstrates that LAMPs can be understood within a traditional information-theoretic framework, and many further results are possible in this context.Future work will explore the application of LAMPs to understanding long-range phenomena such as in social media communications [MMNB17], to provide enhancements to network traffic measurements such as [NR13], and to derive online estimators for entropy such as in [RVA98].

APPENDIX A ON CONNECTING A MARKOV CHAIN: CHOOSING ARTIFICIAL STATES AND CONNECTION PROBABILITIES
For a unique stationary distribution of a Markov Chain to exist, it is necessary for the Markov Chain to be ergodic [Nor98].By adding an artificial state to the chain, which is connected to all other states with some transition probability p artificial , we can guarantee the chain is ergodic, and that a unique stationary distribution exists.The choice of hyperparameter p artificial is critical to ensuring this artificial state does not dramatically alter the behaviour of the Markov chain and instead acts as a link between otherwise disconnected communicating classes.
The parameter p artificial was chosen for each of the four datasets for both the LAMP and first-order Markov Chain approaches by calculating entropy estimates for p artificial = 2 −i for i = 1, ...., 25 for the first-order Markov estimates and i = 1, ..., 50 for the LAMP estimates.Since we are selecting for small values of p artificial , it is also important to ensure we avoid numerical precision errors.It is important that the estimated value is robust to small changes in p artificial .By selecting values of p artificial for which the estimate is stable and appears to have converged, we can estimate the entropy of the process.The true entropy value will differ between models and datasets, so we are not concerned with the value which the estimate converges to, only the stability of the estimate.
Generally, these estimates approach a stable value for larger values of i, although it is not known if this trend would continue, due to precision limitations.The largest value for which the estimate is stable was taken to perform each of the estimates was used.
The results of these simulations are shown in Figure 2.For all dataset model combinations except for the entropy estimates obtained using the first-order Markov model for the WIKISPEEDIA and the REUTERS datasets, the entropy values approach a limiting value from above, with most estimates converging to a stable value.The only estimate which does not appear to converge is the estimate obtained with the firstorder model for the BRIGHTKITE dataset, which begins to flatten between -7 to -10, but then approaches zero.This may be due to numerical limitations, but this behaviour will be explored in future work.
In Figure 2, the entropy estimates are normalised to have a minimum value of 0 and a maximum value of 1, to enable the shape of the curves to be easily compared.This highlights the shape of the curve and emphasises the direction of convergence.These normalised convergence curves for the WIKISPEEDIA and REUTERS datasets overlap for the firstorder Markov model, so the values for the REUTERS dataset are offset by +0.04 for visualisation.Similarly, for the LAMP model convergence curves, a small offset is added to REUTERS and BRIGHTKITE datasets in the LAMP model visualisation (+0.01 and -0.005 respectively).
The choice of p artificial = 2 −15 appears appropriate for all datasets where the estimates appear to converge, while remaining large enough to avoid precision errors.This artificial transition probability is equivalent to observing a single transition amongst 32 768, which is sufficiently small to have a minimal impact on the scales we are considering.It is important that this choice is evaluated for different datasets, where the size of the data may make a different value more suitable.  .Plots to show convergence for the normalised entropy estimate value against log 2 p artificial , which was used to ensure ergodicity.(i) Shows hyperparameter sensitivity for the first-order Markov models and (ii) for the LAMP models.Each plot shows the effect of the weight of this artificial link on both the first-order Markov model estimate and the estimate obtained using the LAMP model for each dataset.We aim to find a region where the estimates are insensitive to the choice of hyperparameter.A dashed black line on each plot indicates the value when the artificial link weight is 2 −15 .This value was chosen as a global value, since it is a reasonable choice for all dataset model combinations, apart from the BrightKite dataset first-order Markov model, when a value of 2 −10 was used to obtain the final estimate.This alternative value is indicated by a grey dashed line.The convergence curve for the WIKISPEEDIA and REUTERS datasets overlapped for the firstorder Markov model, so the values for the REUTERS dataset was offset by +0.04 for visualisation.Small vertical offset was also added to the REUTERS and BRIGHTKITE datasets in the LAMP model visualisation (+0.01 and -0.005 respectively).

Figure 1 .
Figure1.(i) A diagram of the underlying transition matrix for each of the models shown in (ii) -(vi).While these models all produce sample paths with differing dependency structures, they all have the same entropy rate, as proven in Theorem 4.1.In blue, the dependency structure of each model is visualised using Cramér's V for lags up to 40.This provides a measure of correlation for symbolic variables and is analogous to considering a lagged-cross correlation function.In (ii), the dependency structure is shown for a first-order Markov model.In (iii) -(vi) we see the frequency of w k , or the distribution of the size of the backward step directly influences the shape of the dependency structure.While the Cramér's V is calculated for a lag of 0 for the first-order Markov Model, it is excluded from plots (iii)-(vi) since the value is much larger due to the Markov Chain's memoryless property.It was excluded to allow the similarity between the frequency distribution of step size and dependency structure to be more easily seen.

Figure 2
Figure2.Plots to show convergence for the normalised entropy estimate value against log 2 p artificial , which was used to ensure ergodicity.(i) Shows hyperparameter sensitivity for the first-order Markov models and (ii) for the LAMP models.Each plot shows the effect of the weight of this artificial link on both the first-order Markov model estimate and the estimate obtained using the LAMP model for each dataset.We aim to find a region where the estimates are insensitive to the choice of hyperparameter.A dashed black line on each plot indicates the value when the artificial link weight is 2 −15 .This value was chosen as a global value, since it is a reasonable choice for all dataset model combinations, apart from the BrightKite dataset first-order Markov model, when a value of 2 −10 was used to obtain the final estimate.This alternative value is indicated by a grey dashed line.The convergence curve for the WIKISPEEDIA and REUTERS datasets overlapped for the firstorder Markov model, so the values for the REUTERS dataset was offset by +0.04 for visualisation.Small vertical offset was also added to the REUTERS and BRIGHTKITE datasets in the LAMP model visualisation (+0.01 and -0.005 respectively).
ENTROPY ESTIMATES FOR EACH OF THE FOUR DATASETS USING SEVEN ESTIMATION METHODS.TWO MAIN APPROACHES ARE USED TO ENSURE TRANSITION MATRICES ARE ERGODIC: ONLY CONSIDERING THE LARGEST CONNECTED COMPONENT; AND ADDING AN ARTIFICIAL STATE WITH TRANSITION WEIGHT pARTIFICIAL .BOTH OF THESE APPROACHES GIVE SIMILAR RESULTS FOR EACH OF THE FOUR DATASETS.