A Geometric Representation of Collective Attention Flows

With the fast development of Internet and WWW, “information overload” has become an overwhelming problem, and collective attention of users will play a more important role nowadays. As a result, knowing how collective attention distributes and flows among different websites is the first step to understand the underlying dynamics of attention on WWW. In this paper, we propose a method to embed a large number of web sites into a high dimensional Euclidean space according to the novel concept of flow distance, which both considers connection topology between sites and collective click behaviors of users. With this geometric representation, we visualize the attention flow in the data set of Indiana university clickstream over one day. It turns out that all the websites can be embedded into a 20 dimensional ball, in which, close sites are always visited by users sequentially. The distributions of websites, attention flows, and dissipations can be divided into three spherical crowns (core, interim, and periphery). 20% popular sites (Google.com, Myspace.com, Facebook.com, etc.) attracting 75% attention flows with only 55% dissipations (log off users) locate in the central layer with the radius 4.1. While 60% sites attracting only about 22% traffics with almost 38% dissipations locate in the middle area with radius between 4.1 and 6.3. Other 20% sites are far from the central area. All the cumulative distributions of variables can be well fitted by “S”-shaped curves. And the patterns are stable across different periods. Thus, the overall distribution and the dynamics of collective attention on websites can be well exhibited by this geometric representation.


Introduction
In each second, 684478 pieces of content will be shared on Facebook, 204166667 emails will be sent, 100000 tweets will be posted, 27778 new posts will be published on Tumblr, and 571 new websites will be created, data keeps growing with no signs of stopping [1]. However, only 3 billion users (2014) consume this ever-accumulating information on the Internet [2], we are drowning in the sea of information and data. As pointed out by H.A. Simon, "a wealth of information creates a poverty of attention" [3], attention will doubtlessly play more and more important roles in the near future because of its scarcity and the overload of information. Thus, websites. We further quantitatively investigate the distribution patterns of sites, attention flows and dissipations, and find that the cumulative quantities can be well fitted by "S"-shaped curves. Accordingly, all websites can be grouped into three layers. Popular websites like Google.com, Myspace.com, etc. are in the core where a large fraction of attention flows and a relative small fraction of dissipations are attracted. In the interim, a large number of websites with a small proportion of attention flows and a relative large fraction of dissipations locate. And other small sites with a few traffics and dissipations locate in the periphery. All the observations are stable along time.

Data
The raw data that we employed is from the clickstream data of Indiana university campus (http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/), which records the surfing behaviors of the users in Indiana University during the period 2006. .02 [20,23]. Although this data set is a biased sample of the entire WWW [20,[24][25][26][27], it contains 123137 websites and 45563567 traffics one day in average, especially the top websites during 2006. .02 are included, like Myspace.com, Facebook.com, Yahoo.com, and so on. Therefore, a clear overall picture can be obtained, and the results are representative.
We adopt the classification of websites in (http://sitereview.bluecoat.com/sitereview.jsp). There are a lot of advertising websites in the raw data, like 2o7.net, Advertising.com, Doubleclick.com, and so on. We identify them according to the classification data and remove them from the raw data.

Construction of open flow networks
In the raw data, the switches between two websites are recorded. The basic format of one record is like this: (time stamp, referrer, host, path), where the time stamp is the unix time of the surfing behavior. Referrer and host are domain names, and path is the visiting location of the website.
We construct an open flow network model [28,29] for all the records in one day. First, we parse all the records in the data set, and extract their domain names. We maintain a dictionary for storing all the distinct domain names (websites), and replace the domain name strings in the raw data (referrer and host) by the index of the website in the dictionary. Second, we ignore time stamps and count the total number of transitions f ij for the given pair of web site indices, i and j. This is the flow from i to j. Further, there are some null strings in the referrer records in the raw data representing that these transitions have no or missing referrers. We treat these records as the flows from the source (the outside world), thus the null string represents the source. Third, we balance the entire network by adding dissipation flows from all nodes to the sink such that all the inflow balances with the out flow for each node. The dissipations must be added manually because all the jumps to off-line world are not recorded in the raw data.
Finally, we can obtain an (N + 2) × (N + 2) flux matrix denoted as F, where N is the total number of websites, In which, node 0 represents the source, and node N + 1 represents the sink. Therefore, the flow f i, N + 1 is the dissipation of site i, and P Nþ1 i¼1 f ij is the total attention flow (traffic) of i. We also call this flux matrix as attention flow network.
The unique advantage of the open flow network and the distinction from conventional topological network and closed flow network are the consideration of flows between site s, especially the flows from the source and to the sink.

Flow distances
Many scholars have paid attention to distances on a network, like the shortest path distance [30], and the mean first-passage distance derived by the random walk model [31][32][33][34]. While due to the existence of the source and the sink, the conventional methods for computing the random-walk distance could not be directly applied to the open flow network. Therefore, we must develop a new method to calculate the flow distance. First, we will calculate a markov matrix according to the original flow matrix, Where, M is the markov transition matrix, and m ij represents the probability of a user jumping from i to j.
The flow distance between two websites is defined as the average distance that one visitor jumping from i to j for the first time along all possible flow pathways [21]. The closer the websites, the easier is for visitors' jumping from one website to another. According to [21], the flow distance between two websites is l ij : Where, U = I + M + M 2 + Á Á Á = (I−M) −1 , (U) ij is the pseudo-probability from i to j along all possible paths and I is the unit matrix with N + 2 nodes.
Next, we will apply our method on an example open flow network to interpret the calculation of flow distances, as shown in Fig 1. In Table 1, we compare three kinds of distances for the example network. First, the shortest distances based on the binary link structure always under estimate the walk distances for real users because the shortest distances assume that all users can find the shortest path in the entire network level. Second, we compare the random walk distances defined as the average steps along all possible flow paths that a random walker jumping along the links on the closed version of the same flow network. In the closed flow network, the source and the sink are excluded, and the jumping probability from i to j is just the ratio between the flux f ij and the total out flow ( P Nþ1 j¼1 f ij ). Thus, the dissipation of all nodes are not considered. In this way, the random walk distances of the closed flow network always over estimate the average path lengths. However, real users always not travel too long paths because they are very likely to get offline in each jump. Thus, flow distance can depict the average distance among websites more accurately by considering almost all information of the network [21].
However, the flow distances matrix L is not symmetric, but the embedding into the Euclidean space requires symmetric distances. Thus we calculate symmetric flow distances C as: This is a measure for averaging l ij and l ji . And it can be also explained as the average commuting distance [22] which is the average path length for a random walker going from i to j and finally returning back to i again.

The embedding of websites
According to the symmetric flow distances (c ij ), we can embed websites into a Euclidean space by imposing the Euclidean distances equalling the flow distances among websites as accurately as possible. Each node in the Euclidean space has a "geometric image" of the websites. We adopt the reduced version of the Bigbang algorithm [35] to embed. The algorithm implementation is shown as the following steps.  Step1: Initialization phase: assign a random position v d k denoting the coordinate vectors of website k in d-dimensional space for each website in d-dimensional space.
Step2: Adjustment phase: compute each node's position according to the spring algorithm [36] such that the embedding errors for all pairs of websites E ð1Þ ij ðV i ; V j Þ are small enough.
In which, , where k V d i À V d j k is the norm of the vector V d i À V d j denoting the Euclidean distance between websites pairs(i, j), and c ij is the symmetric flow distance between i and j. E ð1Þ ij ðV i ; V j Þ denotes the difference between the Euclidean distance and the flow distance.
We will use the spring algorithm to compute the positions. Suppose that any two websites are connected by a spring and the relaxed length of each spring equals to the flow distance c ij , if the distance of two websites in the d-dimensional space is larger than the relaxed length of the spring, the websites will exert a pulling force. Otherwise, there will be a repulsive force between them. The size of the force is proportional to value of E ð1Þ ij ðV i ; V j Þ. This step will repeat until the total embedding error(denoted by d, the average embedding distortion, which is introduced in the Step3) is reduced to a given threshold level (in this paper, the level is set to 1.50).
Step3: Fine-tuning phase: fine-tune the positions according to the embedding distortions, denoted by E ð2Þ ij : where depicts the max ratio of the Euclidean distance and the flow distance. The effectiveness of the embedding method can be characterized by d: Repeat adjusting the positions of nodes by using the spring algorithm, until d (the average embedding distortion between websites) is smaller than 1.14.
Actually, the first adjustment phase makes large adjustments quickly while the second phase modulates the websites positions slightly when the value difference between Euclidean distance and flow distance is small. Taking the ratio of Euclidean distance and flow distance as an important indicator is to eliminate the effect of the value size of distances.

Results
The distribution of flow distances among websites We find that they are similar in different times and the average distances in four snapshots are all close to 4.5 which exhibiting the small world effect. The flow distance notion both considers the topological closeness of websites and the average real behaviors of surfing which is apparently different from the traditional shortest path distance [30] and random walk distance [31][32][33][34] on close flow networks (see the detailed discussions in the method section).

The geometric representation of websites
Next, we select a strong connected sub-network containing 2200 websites from the top 4000 websites of the original network on October 10,2006 to be embeded into a 20 dimensional Euclidean space by using a reduced version of BigBang algorithm [35] (details can be referred to the method section) such that the Euclidean distance between any two nodes is as closed as possible to their flow distance. In this way, each node obtains a coordinate. To visualize this sub-network, we project all nodes into a two dimensional space by using the PCA method [37,38] to reduce the dimensionality as shown in The positions of websites can reveal their centrality in the whole network. The sites locating in the central areas are always more important than other sites for the whole websites ecosystem because the distances from the central sites to others being small implies that users may visit the central websites frequently wherever they come from or go to. It is interesting that although the social network sites like Myspace.com, Facebook.com attract very large amount of traffics, they are not the center of the whole system. Instead, Google.com is more central in the sense of attention flow positions than the social networks (see Table 2). This observation is consistent with our intuition that Google.com has become the portal of the whole web world and to transport users' attention into the virtual world. Table 2 lists top 15 websites ranked by the average distance to other nodes in a decreasing order. As a comparison, we also listed the ranking results by PageRank algorithm [39], which are shown in Table 3. We find that PageRank tends to give high ranks to the websites with more in-links (Indiana.edu, Amazon.com, etc.), but not high attention flows (Myspace.com, Msn.com, Cnn.com, Aol.com, and so on).

The distributions of attention flows, attention dissipations and websites in the geometric representation
Because Google.com has the smallest average distance to other websites, it is set as the center of the geometric representation for all other websites which form a nearly symmetric ball around the center. Therefore, we study the distributions of the variables including attention flow (the traffic of each web site), attention dissipation (the flow to the sink from each web site), and the number of websites along the distance from the center of the ball. Instead of drawing the density curves of focal quantities directly, we accumulate them within the given radius to reduce the effect of noise in the data because cumulative curves are equivalent to density curves for  distributions. We discover that with the increase of radius, the cumulative amounts of the quantities within the radius show sigmoid growth patterns (see Fig 5) and These "S" curves reflect the heterogeneities of the distributions. From Fig 5, we know that most of quantities are concentrating in the central areas with radius 6. We then separate the whole ball into three layers along the radius according to the quantiles of the number of websites. The first layer is the ball with radius 4.1 which is selected according to the 20% quantile of the websites. That means 20% most important (popular) websites are included in this layer. However, it attracts almost three quarters of attention flows in the whole network with only a relatively small fraction of dissipations. Therefore we regard the layer as the core. In the second layer with radius in between 4.1 and 6.2 (the 80% quantile of websites), about 60% sites are included, but only 22% attention flows are contained with the cost of 38% dissipations. That means these websites are not attractive enough. We call this layer as the interim. Other small websites locate in the last layer, the periphery, which being of radii larger than 6.
To quantitatively characterize the "S"-shaped curves along radius for these three quantities, we use the gompertz function [40] to fit the normalized cumulative curves of attention flows (traffics, T(R)), attention dissipations (D(R)), and the number of websites (N(R)) within the radius R. The fitting functions can be expressed as: Where X can be T, D, N, and k X , c X , are the corresponding parameters to be estimated. And c X characterizes the slope of the fast raising phase of the "S"-shaped curve, k X indicates the offset of the whole curve along the x coordinate. The fitting results are shown in Table 4.   The relative growth of cumulative variables in the radial direction To compare the relative rates of accumulation for different variables along radius, we can plot two variables together on one coordinate as shown in Fig 6. The curves can be also predicted theoretically by combining the gompertze functions together to eliminate R. For example, we consider the relationship between N(R) and T(R) and we know: ( After eliminating R, we have: Geometric Representation of Attention Flows From Fig 6, we can read the relative speed of accumulation for any pair of variables. Thus, attention flow and dissipation accumulate faster than the number of websites along the radius. And the attention flow is faster than the dissipation. These curves resemble the Lorenze curve in income distribution which are named as Lorenze-liked curves in this paper. If the speeds of the two variables are the same, the curves collapse to the diagonals. And the bending degrees of curves reflect the differences on the speeds which can be quantified by a GINI-liked coefficient (G) defined as the difference (A − B) between the area (A) enclosed by the diagonal and the horizontal line and the area (B) enclosed by the fitting curve and the horizontal line. Therefore, if the fitting curve is above the diagonal, the GINI-liked coefficient is negative. The GINI-liked coefficients are shown in Fig 6. Therefore, according to the Lorenze-liked curves, the amount of attention flow concentrates on the core layer, so it increases faster than the dissipation. The number of websites accumulates along the radius with a very slow speed compared to the other variables because the density peak appears in the second layer. Thus, very a few popular websites dominate the attention resources. And also, these websites are sticky enough so that the accumulative speed dissipation is slower than the attention flow.

The Dynamics of the geometric representation
Next, we study the dynamics of the representation. Four special snapshots for different times are selected such that the time spans between any two snapshots have similar lengths as shown in Fig 7. It is interesting to observe that Google.com always locates in the center of the geometric representation in October 10,2006 and February 10,2008, while Yahoo.com and Msn.com go out of the central area of the map gradually. That indicates that Google.com has out-competed Yahoo.com and Msn.com to become the dominator of the search engine. However, after Youtube.com's establishment on April 23, 2005, it quickly attracted large proportion of attention and run into the central area of the geometric representation, and become an important website in the entire attention ecosystem.
Comparing the "S" curves between N(R) and R in different years, we can find that the "S" curves have slightly shifted over time for the part of R 5 as shown in Fig 8. This indicates that the central area of the system are becoming denser as time goes by indicating that the websites are closer and more connected each other. The distribution of attention flows, attention dissipations and websites within different time periods of a day are also discussed in the supporting information.
The parameter k of gompertz function controls the translation of "S" curves and c controls the growth rate of the curves. From Fig 9, we can see that there is a downward trend from October 10, 2006 to February 10, 2008, corresponding to the cumulative curve of the number of websites offsets to left. It is also apparent that c N is larger than c T , c D in most situations, meaning that N(R) has more dramatic increases than T(R) and D(R) in general. This indicates the distribution of websites along the radius is always more heterogeneous.
Furthermore, we consider the relative growth speeds of different cumulative variables along the radius in different snapshots which can be shown by the dynamics of the GINI-liked coefficients. From Fig 10, we read that the relative growth speeds in Fig 6 are almost kept. However the coefficient for T and N decreases continually which can be accounted by the left offset of the "S"-curve of the distribution for websites.

Discussions and Conclusions
In this paper, we try to embed selected websites in a high dimensional space, and study the distributions of collective attention flows, dissipations on websites by using a biased collection of clickstream data. The geometric representation of the websites is based on a novel notion of flow distance defined on the underlying open flow network model of attention flow which integrates the topological structure of hyperlinks and the collective behavior of user traffics between sites. We find that although the social networks like Myspace.com and Facebook.com own most of the users' attention, the most central website is the search engine-Google.com, in which the centrality is quantified by the average flow distances of the focal sites with all other websites. We then focus on the collective distributions of websites, attention flows, and dissipations on the geometric representation space. We find that the geometric representation resembles an nearly symmetric ball, in which three different layers along the radius direction of the ball can be divided according to the distributions of attention flow and websites. The most inner layer, "core", attracts 75% of attention flows and 55% dissipations by only 20% popular websites. While the second layer "interim" encloses most of normal websites (60%), but only 22% attention flows with 38% dissipations. The last layer, "periphery", contains the left 20% websites with only 3% attention and 7% dissipations. Therefore, the distributions of attention flows, dissipations, and websites in the geometric representation are of great unevenness which can be well described by the "S"-shaped cumulative curve and the Lorenz-liked curve of relative growth along the distance from the center.
Finally, we show the general trends of dynamics of the representation by studying four snapshots of the geometric representations in different time points. We find that in general the Our research also have some drawbacks, for example, we only use the surfing records of one university to represent the traffics on the World Wide Web. The dataset is apparently biased and limited. However, we believe that our method and basic conclusions can be extended to larger data sets because this sample is representative. Second, the geometric representation needs to compute the pair-wised distances of all the websites, and the complexity increases in an N 2 speed. Further, the matrix inversion operation is needed when we compute the flow distance. This will make the whole task very tough when N is large. Therefore, some approximate methods such as Monte Carlo simulation are deserved.
Our work has some potential applications. First, the methodology can be applied to other fields rather than the clickstream data. The geometric representation can at least provide a good visualization for open flow networks. Second, our work may give an alternative evaluation for websites which is apparently distinguished with PageRank method and the traffic data. This evaluation can reflect both the link structure of websites and the collective behaviors of users. Third, the flow distance between two websites can also provide information of indirect interactions. This may outperform the traditional analysis approaches which merely focusing on directed upcoming websites, and help web masters to post their advertisement in more appropriate places.