Detecting Memory and Structure in Human Navigation Patterns Using Markov Chain Models of Varying Order

doi:10.1371/journal.pone.0102070

Figure 1.

Example of a navigation sequence in the WikiGame dataset.

Bottom row of nodes: A user navigates a series of Wikipedia articles, which can be represented as a sequence of Web pages. Top row of nodes: Each Wikipedia article can be mapped to a corresponding topic through Wikipedia's system of categories. This results in a sequence of topics.

More »

Expand

Figure 2.

Log-likelihoods for random path dataset.

Simple log-likelihoods of varying Markov chain orders would suggest higher orders as the higher the order the higher the corresponding log-likelihoods are. This suggests that looking at these log-likelihoods is not enough for finding the appropriate Markov chain order as methods are necessary that balance the goodness-of-fit against the number of model parameters.

More »

Expand

Table 1.

Dataset statistics.

More »

Expand

Figure 3.

Topic frequencies.

Frequency of categories (in percent) of all paths in (A) the Wikigame topic dataset (B) the Wikispeedia dataset and (C) the MSNBC dataset. The colors indicate the categories we will investigate in detail later and are representative for a single dataset – this means that the same color in the datasets does not represent the same topic. The Wikigame topic dataset consists of more distinct categories than the Wikispeedia and MSNBC dataset. Furthermore, the most frequently occuring topic in the Wikigame topic dataset is Culture with around 13%. The Wikispeedia dataset is dominated by the two categories the most Science and Geography each making up for almost 25% of all clicks. Finally, the most frequent topic in the MSNBC dataset is the frontpage with a frequency of around 22%.

More »

Expand

Figure 4.

Model selection results for the Wikigame page dataset.

The top row shows results obtained using likelihood and information theoretic results: (A) likelihoods, (B) likelihood ratio statistics (* statistically significant at the 1% level; ** statistically significant at the 0.1% level) as well as AIC (C) and BIC (D) statistics. The bottom row illustrates results obtained from Bayesian Inference: (E) evidence and (F) Bayesian model selection. Finally, the figure presents the results from (G) cross validation. The overall results suggest a zero order Markov chain model.

More »

Expand

Figure 5.

Model selection results for the Wikigame topic dataset.

The top row shows results obtained using likelihood and information theoretic results: (A) likelihoods, (B) likelihood ratio statistics (* statistically significant at the 1% level; ** statistically significant at the 0.1% level) as well as AIC (C) and BIC (D) statistics. The bottom row illustrates results obtained from Bayesian Inference: (E) shows evidence and (F) Bayesian model selection. (G) presents the results from cross validation. The overall results suggest that higher order chains seem to be more appropriate for our navigation paths consisting of topics. In detail, we find that a second order Markov chain model for our Wikigame topic dataset best explains the data.

More »

Expand

Figure 6.

Model selection results for the Wikispeedia dataset.

The top row shows results obtained using likelihood and information theoretic results: (A) likelihoods, (B) likelihood ratio statistics (* statistically significant at the 1% level; ** statistically significant at the 0.1% level) as well as AIC (C) and BIC (D) statistics. The bottom row illustrates results obtained from Bayesian Inference: (E) shows evidence and (F) Bayesian model selection. (G) presents the results from cross validation. The overall results suggest that higher order chains seem to be more appropriate for our navigation paths consisting of topics. Concretely, we find that a second order Markov chain model for our Wikispeedia topic dataset best explains the data.

More »

Expand

Figure 7.

Model selection results for the MSNBC dataset.

The top row shows results obtained using likelihood and information theoretic results: (A) likelihoods, (B) likelihood ratio statistics (* statistically significant at the 1% level; ** statistically significant at the 0.1% level) as well as AIC (C) and BIC (D) statistics. The bottom row illustrates results obtained from Bayesian Inference: (E) shows evidence and (F) Bayesian model selection. (G) presents the results from cross validation. The overall results suggest that higher order chains seem to be more appropriate for our navigation paths consisting of topics. Specifically, the results suggest a third order Markov chain model.

More »

Expand

Figure 8.

Global structure of human navigation.

Common transition patterns of navigational behavior on all three topics datasets (Wikigame, Wikispeedia and MSNBC). Patterns are illustrated by heatmaps calculated on the first order transition matrices. Each cell is normalized by the total number of transitions in the dataset. The vertical lines depict starting states and the horicontal lines depict target states. A main observation is that self transitions – e.g., a transition from Culture to Culture – are dominating all datasets. However, the goal-oriented datasets (Wikigame and Wikispeedia) exhibit more transitions between distinct categories than the free navigation dataset (MSNBC).

More »

Expand

Figure 9.

Local structure of navigation for the Wikigame topic dataset.

The graphs above illustrate selected state transitions from the Wikigame topic dataset for different values. The nodes represent categories and the links illustrate transitions between categories. The link weight corresponds to the transition probability from the source to the target node determined by MLE. The node size corresponds to the sum of the incoming transition probabilities from all other nodes to that source node. In the left figure the top four categories with the highest incoming transition probabilities are illustrated for an order of . For those nodes we draw the four highest outgoing transition probabilities to other nodes. In the middle figure we visualize the Markov chain of order by setting the top topic (Culture) as the first click; this diagram shows transition probabilities from top four categories given that users first visited the Culture topic. For example, the links from the red node (Society) in the bottom-right part of the diagram represent the transition probabilities from the sequence (Culture, Society). Similarly, we visualize order in the right figure by selecting a node with the highest incoming probability (Culture, Culture) of order . We then show transition probabilities from other nodes given that users already visited (Culture, Culture). For example, the links from the brown node (Politics) at the top represent the transition probabilities from the sequence (Culture, Culture, Politics).

More »

Expand

Figure 10.

Self transition structure of navigation for the Wikigame topic dataset.

The number of times users stay within the same topic vs. the number of times they change the topic during navigation for different orders for our Wikigame dataset. Only the top three categories with the highest transition probabilities are shown. With high consistency, the transition probabilities to the same topic increase while those to other categories decrease with ascending order .

More »

Expand

Figure 11.

Local structure of navigation for the MSNBC dataset.

The graphs above illustrate selected state transitions from the MSNBC dataset for different values. The nodes represent categories and the links illustrate transitions between categories. The link weight corresponds to the transition probability from the source to the target node determined by MLE. The node size represents the global importance of a node in the whole dataset and corresponds to the sum of the outgoing transition probabilities from that node to all other nodes. For visualization reasons we primarily focus on the top four categories with the highest sum of outgoing transition probabilities – i.e., those with the largest node sizes – for an order of . For those nodes we draw the four highest outgoing transition probabilities to other nodes. In the middle figure we visualize the Markov chain of order by setting the top topic (frontpage) from order as the first click; this diagram shows transition probabilities from top four categories given that users first visited the frontpage topic (represented by the dashed transitions in the left figure representing ). For example, the links from the blue node (news) in the top-left corner of the diagram represent the transition probabilities from the sequence (frontpage, news) to other nodes. Similarly, we visualize order in the right figure by selecting a node with the highest sum of outgoing transition probabilities (frontpage, frontpage) and its four highest outgoing transition probabilities from order (represented by the dashed transitions in the middle figure representing ). We then show transition probabilities from other nodes given that users already visited (frontpage, frontpage). For example, the links from the red node (sports) at the top represent the transition probabilities from the sequence (frontpage, frontpage, sports) to other nodes.

More »

Expand

Figure 12.

Self transition structure of navigation for the MSNBC dataset.

The number of times users stay within the same topic vs. the number of times they change the topic during navigation for different values of . Only the top three categories with the highest transition probabilities are shown. With high consistency, the transition probabilities to the same topic increase while those to other categories decrease with ascending order .

More »

Expand

Figure 13.

Common global transition patterns of navigational behavior on the Wikigame topic dataset.

The results should be compare with Figure 8. The results are split by only looking at a corpus of paths where each path starts with the same topic as it ends (A) and by looking at a corpus with distinct start and target categories (B).

More »

Expand