Search strategies of Wikipedia readers

The quest for information is one of the most common activity of human beings. Despite the the impressive progress of search engines, not to miss the needed piece of information could be still very tough, as well as to acquire specific competences and knowledge by shaping and following the proper learning paths. Indeed, the need to find sensible paths in information networks is one of the biggest challenges of our societies and, to effectively address it, it is important to investigate the strategies adopted by human users to cope with the cognitive bottleneck of finding their way in a growing sea of information. Here we focus on the case of Wikipedia and investigate a recently released dataset about users’ click on the English Wikipedia, namely the English Wikipedia Clickstream. We perform a semantically charged analysis to uncover the general patterns followed by information seekers in the multi-dimensional space of Wikipedia topics/categories. We discover the existence of well defined strategies in which users tend to start from very general, i.e., semantically broad, pages and progressively narrow down the scope of their navigation, while keeping a growing semantic coherence. This is unlike strategies associated to tasks with predefined search goals, namely the case of the Wikispeedia game. In this case users first move from the ‘particular’ to the ‘universal’ before focusing down again to the required target. The clear picture offered here represents a very important stepping stone towards a better design of information networks and recommendation strategies, as well as the construction of radically new learning paths.


Walks lengths distribution
In Fig. A, we report the distribution of the lengths of the simulated paths, originated from the available 10 sources classified in the dataset and described in the main text. From each source, we generated 10 7 paths.

Semantic vectors representation: robustness results
The data and results reported in the main text are based on a dump of the English Wikipedia dated 10-22-2015. In this section, we report the same analysis on the paths originated from the source google, but referring to a different dump and a slightly modified procedure for the vector extraction. . Paths generated from the external source google: averages over a 38-topic semantic space. The 10 7 paths simulated with google as source were split by lengths. For each fixed length l, we computed the averages of the following quantities over all the nodes(pairs) at k steps(jumps) to the end: (A) the average norm w l k , (B) the entropy S(w l k ), (C) the distance and (E) the similarity between all the pairs of nodes consecutively visited along each path, respectively d(w l k , w l k−1 ) and sim(w l k , w l k−1 ), (D) the distance and (F) the similarity between every node visited and the ending node along each path, i.e. d(w l k , w l 0 ) and sim(w l k , w l 0 ). The error bars display the standard errors of the means. Each color refers to a path length, from 3 (blue) to 9 (light green).
The dump here considered is dated 04-03-2015. In it, the subcategories of the Main Topic Classification were 38, and namely: agriculture, architecture, arts, chronology, creativity, culture, education, employment, energy, environment, geography, goods, government, health, history, humanities, humans, industry, information, knowledge, language, law, mathematics, medicine, mind, nature, objects, people, politics, science, sports, structure, systems, technology, telecommunications, universe, world. For each page, in extracting its vector representation based on the 38 coordinated listed above, we performed the first three phases as explained in the main text ( Fig. 2 A-C): we selected the categories to which each page belongs to, and for each category we identified its most representative topic(s). This was the one(s), among the 38, from which the category depth is minimal. This depth is the semantic representativeness of the topic. In the original procedure, for each topic only the smallest depth over the categories was considered when deriving the final vector. Its inverse was chosen as corresponding weight. Here, instead of considering the smallest contribution, for each topic we average the depths over all the categories for which that topic is the most representative. The inverse of the average is the novel weight for the topic in the final vector.
With this choice of vector representation, we replicate the analysis of the paths originated from google. The results of the averages over each position along paths of fixed lengths, and of the same averages rescaled via the aggregated means are reported in Fig. B

Averages over unrescaled paths
In the following figure, the global averages of the different observables shown in Fig. 4 of the main text are reported for the three datasets(google, the null model, and Wikispeedia. The averages are: (A) the average norm < w l k > k , (B) the entropy < S(w l k ) > k , (C) the distance and (E) similarity between consecutive nodes, respectively < d(w l k , w l k−1 ) > k and < sim(w l k , w l k−1 ) > k , and finally (D) the average distance and (F) similarity to the last node of each path, respectively < d(w l k , w l 0 ) > k and < sim(w l k , w l 0 ) > k . As expected, no patterns emerge in the results for the null model. Diversely, the averages over google paths seems to follow a trend, e.g. the average norm or the distance between consecutive nodes decrease over paths of increasing length.
These averages are used in the main text to rescaled the observables trend along the paths, thus obtaining the results shown in Fig. 5.

Paths rescaled: miscellaneous sources
In this section we report the averages of the usual measures over paths simulated with different sources. As in the analysis reported in the main text, the semantic space we consider is again the one referring to the Dump dated 10-22-2015, i.e. with 13 topic-coordinates.
The sources here discussed are: twitter, bing, yahoo, facebook, wikipedia (any page in Wikipedia different from an article), internal (any page belonging to a different internal Wikimedia project), an empty referer, or other for any different referers. The Wikipedia incoming traffic fluxes from each of them are illustrated in Fig. 1 of the main text.
We can note that the trends of the quantities are quite similar to the ones emerging when google is the source (Fig. 4 of the main text). Still, we observe that the entropy is less Moreover, in this last case the norm behaves differently too. Indeed, the large semantic jump typically seen at the first step of the walk is missed here, when the reader browse directly the first Wikipedia article, as she went straight to her content of interest (Fig. S12(A)).  E. Paths generated from the external source bing : averages. The 10 7 paths simulated with bing as source were split by lengths. For each fixed length l, we computed the averages of the following quantities over all the nodes(pairs) at k steps(jumps) to the end: (A) the average norm w l k , (B) the entropy S(w l k ), (C) the distance and (E) the similarity between all the pairs of nodes consecutively visited along each path, respectively d(w l k , w l k−1 ) and sim(w l k , w l k−1 ), (D) the distance and (F) the similarity between every node visited and the ending node along each path, i.e. d(w l k , w l 0 ) and sim(w l k , w l 0 ). The error bars display the standard errors of the means. Each color refers to a path length, from 3 (blue) to 9 (light green).