Collective Phenomena and Non-Finite State Computation in a Human Social System

We investigate the computational structure of a paradigmatic example of distributed social interaction: that of the open-source Wikipedia community. We examine the statistical properties of its cooperative behavior, and perform model selection to determine whether this aspect of the system can be described by a finite-state process, or whether reference to an effectively unbounded resource allows for a more parsimonious description. We find strong evidence, in a majority of the most-edited pages, in favor of a collective-state model, where the probability of a “revert” action declines as the square root of the number of non-revert actions seen since the last revert. We provide evidence that the emergence of this social counter is driven by collective interaction effects, rather than properties of individual users.

where w i is the ith symbol in word w. We have, further, or, in words, the probability to go from state i to state j and emit the word w is less than or equal to that of simply going from i to j in the same number of steps. By the Perron-Frobenius theorem, the inequality of Eq. 1 implies that all eigenvalues, β i , of A ij (w) are within the unit circle (|β i | ≤ 1 for all i) with equality obtaining only in the case that A ij (w) is identical to A |w| ij . We neglect this latter, trivial case, which only obtains when w is shift-invariant and the all observation runs are given by repeated instances of w. Conversely, the possibility condition amounts to the condition that the matrix A ij (w) is not nilpotent, and there exists a non-zero eigenvalue.
If the system (or our knowledge of it) is distributed over its internal states according to probability vector π i , we can write the probability of observing a repeated string w as a trace, While we have assumed for simplicity that A ij is irreducible, this will not usually be the case for A ij (w). This latter matrix will in general contain both essential and inessential "self-communicating" classes 1 along with a set of nuisance indices that connect to no other class (i.e., i for which A ij (w) is equal to zero for all j) [3]. The structure of A ij (w) may be visualized as a directed acyclic graph. Inessential classes may have non-zero out-degree, while essential classes, and nuisance indices, are the terminal nodes. Self-loops are permitted, and exist for both inessential and essential classes; these will be crucial to our argument below. 1 An index i leads to an index j (written i → j) iff there exists a k such that A k ij (w) > 0. Indices i and j communicate if i → j and j → i. Communication is an equivalence relation, so that classes can be built that contain indices that communicate with each other. Essential classes (sometimes called "final" classes [2]) are those which do not lead to any index outside the class; inessential classes are those which may.
Because the initial distribution π may have zero entries, we consider only the part of A ij (w) corresponding to descendants of the non-zero part of π in the associated directed acyclic graph. Transitions among the set of nuisance indices, by definition, can not repeat an index. Thus their structure is not relevant to the asymptotic behavior of P (w k ), and we may focus on the essential and inessential classes.
We are particularly interested in the classes that will dominate the P (w k ) probability as k becomes large. Consider the restriction of A ij (w) to a particular class α: i.e., construct a submatrix from A ij (w) using only i, j ∈ α. Call this restriction α ij (w). Consider, similarly, the restriction of the distribution π to this class.
Assume first that α ij (w) is diagonalizable. Then, the probability of producing k copies of w, while remaining in the class α, is where β q is the qth eigenvalue of α(w), and By construction of the equivalence classes, α is irreducible. Then, by the Perron-Frobenius theorem, the largest eigenvalue of this matrix, β 1 , is real, has a strictly positive eigenvector, and π (1) is necessarily greater than zero. If α ij (w) is acyclic then P (w k |α) can be written where A 1 > 0, β 1 is real, and |β i | < β 1 for all i > 1, and exp lim If α ij (w) is diagonalizable, but the period, d, is greater than one, we will have additional eigenvectors associated with complex rotations of β 1 , β 1 exp 2πik/d, k = {1 . . . d − 1}. These will lead to additional oscillatory terms in the leading order term; these oscillations of will be governed by an overall exponentially-decaying envelope, so that exp lim regardless of the period of α ij (w). Finally, consider the case of non-diagonalizable α ij (w). In this case, the matrix can be brought into Jordan normal form, with m blocks, each of size n i and associated with an eigenvalue β i . Assume that the matrix is aperiodic. By the Perron-Frobenius theorem, n 1 is equal to one [4]. The kth power of α ij (w) can then be written (see, e.g., Ref. [5]), where A 1 > 0, β 1 is real, and |β i | < β 1 for all i > 1 as before. When k is greater than the largest block size, we can write where f i (k) is a polynomial function of k, of degree n i − 1. Eq. 8 thus obeys Eq. 6; for a non-aperiodic α, an argument identical to the above gives the convergence of Eq. 7.
Having understood the single-class case, we now consider w k strings generated by multiple classes. Any particular string w k may be generated by a set of transitions within and between classes. Because these transitions are governed by the directed acyclic graph structure, there will be a finite number of transitions between states. Thus, as k becomes large, the probability of P (w k ) for a particular set of transitions will be governed by the self-transitions, given by terms of the form Eq. 7.
In particular, P (w k ) is the sum of a finite number of terms; each term in the sum is a product of at most p transitions between classes, and at least k − p terms of the form P (w n |α), for different α. Explicitly, where i indexes the paths of length k through the graph G representing the underlying A ij (w) structure, T i is a prefactor governing the probabilities of transitions between classes, N is the number of classes, and the total number of within-class transitions is forced to grow with k, for all possible paths i. For large k, the growth in the number of possible paths (i.e., the growth of the |p(G)|) is bounded by the growth in the number of ways to partition the sum in Eq. 11. In particular, for large k, the number of possible paths relevant to P (w k ) can increase only polynomially in k. 2 Meanwhile, each term in the sum of Eq. 10 is decreasing exponentially, governed by products of the β i,1 , the largest eigenvalues for the classes that have self-transitions for that term. The dominant terms in the sum will be those for which the exponential decline is slowest. By the Perron-Frobenius theorem, the largest eigenvalue of a submatrix associated with a class of A ij (w) is equal to the spectral radius of the matrix as a whole. If P (w k ) is greater than zero for k larger than p, the pigeonhole principle invoked in the ordinary pumping lemma [6] allows us to assume the existence of at least one self-communicating class; this then means that the spectral radius is equal to that of A ij (w) itself.
which was to be proved. While our paper presents the first explicit application of this form of reasoning to human social systems, we note in passing the use of this kind of reasoning in the study of bird song. Once regarded as strictly finite-state [7], the sound sequences produced by songbirds are now recognized to show features of non-finite-state computation. A recent, compact model of song production in the Bengalese finch (Lonchura striata domestica) [8], demonstrates the need for a self-modifying (and thus non-finite-state) Markov process.
An analysis of data on a different species, the Zebra finch (Taeniopygia guttata), shows that the probability of an additional repetition, the analog of this paper's P (C k+1 )/P (C k ), decreases exponentially [9]. This is, of course, the other way to violate the probabilistic pumping lemma (under the assumption of having reached an aperiodic final class)-the exponential of the lim-sup, Eq. 7, goes to zero as opposed to unity. It is just as much evidence against finite-state computation, but found in the anomalous absence, rather than presence, of extreme events. Numerical study of convergence of repeated word frequencies to exponential decay, with cutoff predicted by the spectral radius. Shown here is the measured decay rate to the asymptotic limit predicted by Eq. 12, for irreducible finite-state processes with ten states, two output symbols {C, R}, w equal to C, and a uniform distribution over values of ρ(A ij (w)), the spectral radius and asymptotic decay rate, between 0 < ρ < 1. Light blue shows 2σ, and dark blue 1σ ranges about the median value. For empirical work, convergence is much faster when considering [P (w q+k )/P (w q )] 1/k , with q larger than the (assumed) number of states.

Appendix S2: Numerical Tests of Convergence Properties
With a view towards determining how the lemma of the previous section applies to actual finite-state processes, we study a restricted class of machines numerically. We sample from the space of probabilistic unifilar machines with p states over a two-symbol alphabet. Such a system can be represented by a weighted, directed graph, with each node having at least one, and at most two outgoing edges, each of which is associated with one of the two symbols, and whose weights sum to unity.
For small p, the underlying graph-theoretic space can be described completely: for each node, we have a choice of one vs. two outgoing edges; in the case of only one outgoing edge, we must choose between the two symbols. Neglecting the possibility of equivalent machines, we then have the number of such machines, as a function of p, as which grows rapidly: there are 12 billion such machines with six states, and more than 10 400 with one hundred states.
We are most interested in how quickly the statistics of an actual machine approaches the limiting value given by Eq. 12. For any particular A ij (w), we can compute the spectral radius and compare that to the ratio P (w k )/P (w k−1 ) found for distributions over initial conditions that include a self-communicating class as a function of k.
In Fig. 1 we show convergence to the limit by sampling the space of strongly-connected ten-state machines, and considering the frequency of a single repeated symbol. We take a uniform prior over ρ(A i (w)), the spectral radius and limit established by the lemma of the previous section, and show the Figure 2. Convergence to exponential cutoff as seen withĈ(q, k) (Eq. 15), for the same system as in Fig. 1. Here we take q equal to ten, the number of states. For the same amount of data, convergence is faster forĈ than C; here convergence forĈ to the asymptotic value (at 1σ confidence), is achieved for k equal to thirty.
convergence ratio, i.e., to provide a numerical example of the limiting process established in the previous section. For small k, P (w k ) may be dominated by movement through nuisance states and inessential classes, and by contributions from essential classes that have small self-communication probability. Convergence to the spectral radius thus occurs much faster when considerinĝ where q is longer than the relevant scales of the transient phenomena (e.g., at least as large as the assumed number of states.) This is shown explicitly in Fig. 2, where we take q to be the number of states in the system.

Appendix S3: Details on Coarse-Graining and Analysis of Wikipedia Behavior
Our coarse-graining of behaviors on any particular page aims at locating where one user reverts (undoes) the contributions of another editor completely. We locate reversion edits in two distinct ways. Firstly, following Ref. [ Secondly, following analyses such as those of Ref. [11], we can look for versions of a page with identical SHA1 checksums; the version with the later timestamp may thus be considered a revert to the earlier page. In general, these two metrics align very well, although not perfectly; in this work, we focus on the latter method as a more objective one that does not rely on editors self-reporting. We do not include self-reverts, or edits that do not alter any aspect of the page (i.e., that would otherwise look like "reverts to the current version"). The probabilistic pumping lemma works in terms of P (w k ), and our analysis considers the probability of repeated cooperation. However, the measurement of P (C k ) in the data, if done naively, leads to unacceptable results. In particular, estimating P (C k ) for a particular page by counting the number of times the string C k appears in the time-series, leads to strong bin-to-bin correlations, since an observation of a string C k necessarily leads to observations of strings of the form C k−1 , C k−2 , . . . , C k− k/2 +1 , and then two observations of the form C k− k/2 , and so on. This would lead to excessive complications in the likelihood analysis; conversely, if the correlations are neglected, it leads to claims of heavy-tailed distributions that spuriously rule out exponential decay.
Instead, we count prefix-and suffix-free strings that do not have this shift problem-in particular, we consider the quantity N (RC k R). As long as N (RC k R) is significantly less than N, counts of RC k R and RC m R are independent of each other and we can write The quantity P (RC k R) itself can be written as In the case that P (C k |R) is the sum of exponentials in k, we have or, in words, that if P (C k ) is a sum of exponentials, so is N (RC k R). The relationship between these two quantities is not always so simple; in the collective state (CS) case, Eq. 16 implies that the quantity N (RC k R) has a different functional form from P (C k ). In particular, we have which is the functional form we fit and display in Fig. 1 of the Main Article.

Appendix S4: Details on Model Selection
In this section we describe in greater detail our methods for distinguishing between the asymptotic and exponential models. Computation of the likelihood ratio requires an error model for the distributions of N (RC k R). Since we lack an explicit model for the errors themselves, as a first approximation, we take measurements of N (RC k R) to be identically and independently distributed. For N (RC k R) N , N the total number of observations, this is a reasonable assumption. Given this, the Poisson distribution of counts follows, and computation of L, the log-Likelihood, or log P (D| w, M ), for any particular model M with parameters w, can be written as where we drop model-independent constants. Given sufficiently flat priors, P ( w|M ), around the peak of this function, this is sufficient to estimate many quantities of interest, including the maximum a posteriori values of w and the error bars on those estimates.
Our main goal, however, is not parameter estimation, but rather model selection, where one compares models with different parameter spaces. In our particular case, one class of models (nEXP) can approximate, by superposition of exponentials, the other class (CS). As the number of exponentials in the sum increases, the approximation becomes increasingly good. We would like to know when we are justified in preferring the more parsimonious model.
Two main frameworks for the resolution of this question exist. On the one hand, the Aikiake Information Criterion (AIC) can be used to estimate the expected KL divergence between the predictions of a model and the true process. In the limit of large amounts of data, it prescribes a constant penalty of k, the number of parameters, to the likelihood.
This penalty is sometimes taken as an "Occam penalty," but the correct interpretation is as a guide for prediction out of sample. Prediction out of sample is a conceptually distinct problem, since a complicated approximation to the true model may work very well in a limited range, particularly in the presence of experimental noise. In Monte Carlo testing, AIC tends to prefer complicated approximations, even in cases where the underlying model is more parsimonious [12]; a related formal result is that AIC is "dimensionally inconsistent," meaning that even in the limit of infinite data, use of the AIC will lead to non-zero probability of choosing an (incorrect) approximation [13].
On the other hand, one can compute (or approximate) what is called the Evidence 3 , which requires knowledge of both the likelihood, P (D| w, M ), and the prior expectation of parameter ranges, P ( w|M ), where k is the number of parameters (dimensionality of w). Formally, the Evidence is proportional to "the probability of the model M , given the data observed," if equal prior probability is given to the models under consideration. As in all model selection cases, absolute values of the Evidence are irrelevant. One considers only ratios and phrases the question, as in Table 1, as to whether (for example) "model A is at least a factor of 10 3 more likely than model B." In this work, we take the latter approach, operating entirely within the Bayesian framework. This is because our contrasting model classes have small numbers (less than ten) of parameters, all of which have clearly specifiable priors, P ( w|M ). Computation of the full posterior is now common when these circumstances obtain, as is often the case in the exact sciences [14][15][16].
In order to calculate E, we use the Laplace (or saddle point) approximation; in log-units, where L is the log-likelihood, w max are the parameters that maximize the likelihood, and A is the Hessian, equal to We refer the reader to Ref.
[17] for details on this approximation. It remains to specify the priors P ( w max |M ) for the two models. The nEXP class has 2n parameters; the CS class has 3. The parameters are of two kinds.
Both nEXP and CS have parameters corresponding to the one-step decay of the underlying quantity P (C k ). In the case of nEXP, there are n such parameters, b i , that play this role. In the case of CS, there is only one, p. We take a uniform prior in p (CS) and b i (nEXP). We allow all p to range independently between zero and 0.995; the high end corresponds to an exponential cutoff of order 200 repeats, much longer than seen in the data.
We then have normalizations of terms (n normalizations for nEXP, one for CS). These are fixed by the value of P (C 1 ), the overall cooperative fraction.
The maximum value of P (C) is unity. This then leads to an overall area factor of for nEXP, where the factor of n! is because the overall sum of all normalizations is confined to the interior of an N-dimensional simplex. In the case of CS, P (C 1 ) is equal to A(1 − p). We thus have to integrate over the range of p values to find the area associated with the CS normalization prior, Finally, CS has a third parameter, α. For each value of 1−p, we allow this to range between zero (pure exponential) and α(p), where α(p) is set to give a 1/e cutoff at 200 repeats. As an example, α(0.995) is zero; if α were greater than zero, the overall function would have an exponential cutoff longer than 200 repeats. Given these, the area factor for nEXP is 0.995 n , and for CS is it 0.995 0 α(p) dp ≈ 1.28841.
Putting together all these area factors, we can then pre-compute − log A, equal to log P ( w max |M ), a constant independent of w. For the George W. Bush article, for example, we have − log A equal to −12.6 for the CS case, and −10.3 (1EXP), −18.7 (2EXP), −27.4 (3EXP). Note that prior areas are not directly comparable between different models; "change of units" (e.g., working in terms of P (RC k R) vs. N (RC k R)) will scale A. This scaling, however, is directly compensated for by the Hessian determinant term.
Together with the max log-likelihood, the determinant of the Hessian, and the +k log 2π, these are sufficient to compute the (Gaussian approximation to) the relative log-Evidence for the two model classes ∆E, reported in Table 1 (Table 2 in the Main Paper). In general, the highest evidence member of the nEXP class is either 3EXP or 4EXP.