Emergence of online communities: Empirical evidence and theory

Online communities, which have become an integral part of the day-to-day life of people and organizations, exhibit much diversity in both size and activity level; some communities grow to a massive scale and thrive, whereas others remain small, and even wither. In spite of the important role of these proliferating communities, there is limited empirical evidence that identifies the dominant factors underlying their dynamics. Using data collected from seven large online platforms, we observe a relationship between online community size and its activity which generally repeats itself across platforms: First, in most platforms, three distinct activity regimes exist—one of low-activity and two of high-activity. Further, we find a sharp activity phase transition at a critical community size that marks the shift between the first and the second regime in six out of the seven online platforms. Essentially, we argue that it is around this critical size that sustainable interactive communities emerge. The third activity regime occurs above a higher characteristic size in which community activity reaches and remains at a constant and higher level. We find that there is variance in the steepness of the slope of the second regime, that leads to the third regime of saturation, but that the third regime is exhibited in six of the seven online platforms. We propose that the sharp activity phase transition and the regime structure stem from the branching property of online interactions.

b. The dependence of discussion tree growth on depth We assume, in the model, that the growth of discussion trees, e.g., in equation (2), depends on tree depth. Essentially, this assumption takes into account that as a discussion tree grows longer and deeper, the probability of response to higher-depth messages, decays. One way to model this is by allowing the responsiveness to depend on generation, g, i.e., Q = Q(g).
There are several functional forms that can represent this tendency of the response probability to decrease with tree depth that cannot be ruled out, theoretically. With the MLE procedure, we tried several functional forms. Here, we report the two forms that provide the highest level of fit; The exponential decay form: Q(g) = C · e −λ·g and the power law decay form: Q(g) = C · g −λ . The best fit, by far, occurred when we used the power law decay form. Therefore, we use the power law form in the estimation reported in the main part of the paper. Interestingly, the MLE estimation of the power law form, exhibit a value of λ ∼ = 1 2 . In other words, our estimation shows that it seems that the functional form of the rate of deceleration of tree growth, f (g), closely approximates g − 1 2 .

c. Maximum likelihood estimations and fit -description of the procedure
The purpose of the MLE process is to estimate the parameters of the offspring distribution function Φ κ N, Q(g) . Notably, the full dynamic picture of discussion tree growth is captured in the offspring distribution function. In our basic scenario, the distribution is homogeneous across tree nodes and solely depends on four parameters: N , N max , q and g. These are the size of the community, the maximal size of community interaction, the constant community responsiveness and the depth of the node in a discussion tree, respectively. In our case, N is known per community, per node and g is known, per node. The observed random variable, i.e., the number of replies per node, κ, is also observed. Under the assumptions of the model, the offspring distribution is taken to be binomial. Therefore, the likelihood function can be written as: The indices i, j, k are for community, tree and node, respectively. The sample size of PLOS 2/12 the binomial process is N (N i , N max ) = min{N i , N max }, i.e., either community i's size, N i if N < N max or N max otherwise. The community responsiveness decays with tree depth g. The functional form of this decay is expressed by f (g) as explained in the previous section. Here, we write the decay function as f (λ, g i,j,k ) to denote that the depth varies across trees and nodes. Maximum likelihood is obtained by the bounded Broyden-Fletcher-Goldfarb-Shanno (BFGS) approach. The solution is global as we use random initial conditions across a wide range of parameters and runs. The results of the estimation are given in Table 2 in the main paper.
To achieve the fit shown in Fig 6, we used the following process. Using the MLE parameters, we simulated trees as a function of community size. For each community, we simulated 1000 trees and for each community size we created 1000 simulated communities. The chosen numbers of trees and community sizes did not have a qualitative effect on the results. Using the MLE and model-based simulated communities, we calculated the median per-capita activity as a function of size. The result is shown in Fig 6. d. Maximum likelihood estimations -Robustness to be the result of N people, each supposedly choosing at random whether to reply or not, with some probability q. The Poisson process, on the other hand, assumes that there is a set mean of replies and a random process in a fixed time window. Last, in the Negative Binomial scenario a different choice of process is made. The Negative Binomial distribution allows a more dispersed distribution of offspring. The results of the two latter forms are presented in C and D Tables. We use the same sample from the main draft for these estimations, containing 20, 000 observations. Both C and D Tables show that the evaluation of the basic probability of reply is more-or-less robust. In D Table, the base distribution is Negative Binomial and thus the interpretation of the probability should be one minus the estimated q. The events PLOS 4/12

Parameter
(which in our data are replies) are considered 'failures' in the language of the negative binomial distribution. The decay of offspring rate with tree depth is captured in the Poisson scenario, but not as much in the Negative Binomial scenario. This suggests that the Negative Binomial may not be a good choice of base distribution. The overall impression is that both distributions capture somewhat more variance, in the sense that the they lower the AIC in model 4. We recognize that the real-life base probability distribution may be more complex than a simple binomial distribution and we hope to explore and address this in future research.  Table shows the estimated parameters for the rest of the datasets. Notably, unlike the TAP dataset, we do not observe the structure of the trees. This is because these platforms do not explicitly render a user's answer to a specific message, but rather they generally allow a user to "chain" their message in a thread of messages. 1 One exception is YouTube that allows a response, but only to a depth of one level within the tree. With YouTube, there is no structure beyond the second level of the tree. Users have learned to use tags in the response body in order to denote who they refer to. However, we cannot be sure that the references are complete. Therefore, we used a different MLE process where we fit the model to the data on the group level rather than on the individual message level. Per each set of parameters (q, N max and probability of tree appearance, p), we simulated an ensemble of trees to generate simulated groups. Given the lower level of this data, we cannot identify λ. We then fit, using MLE, the median of the data to the median of the simulated groups, conditional on the parameters. The results are given in E Table and visualized in C to H Figs. The goodness of fit was measured using Welch t-test between the benchmark model and the main model's distribution of Mean Square Errors. With the exclusion of the RED and WIKI datasets, the model fit looks adequate and seems to provide significantly better fit than the benchmark model (which in this case, we took to be just the intercept as q). Although we cannot explain the difference between these two data (RED and WIKI) and the rest, we note that even though the model shows that the exact pattern is not universal, it seems to be useful across platforms. I and J Figs illustrate the activity-size curves for two more platforms across communities life-cycle stages. These are similar to the life-cycle stages curve shown of the TAP dataset in the main paper (Fig 2) but for other platforms. Here, again, communities were divided into groups of different age (i.e., time since initiation) and, per each group, the activity-size curve was calculated and plotted. Similar to the TAP case, we observe that the three-regime pattern exists throughout the communities lifetime. Unlike the TAP case, it is hard to interpret the differences of the activity-size curves between age tiers. We note in passing that since the transition in the BRDS case (J Fig) occurs for relatively large community sizes (> 50), the samples for the curve calculation in the large-sizes end are small. The implications are: (1) instead of four categories of age we have three in J Fig, and (2) the sample sizes for very large communities (around > 300) were insufficient and so the curve ends at lower sizes compared to Fig 5(a).  [5] Oliver P, Marwell G, Teixeira R. A theory of the critical mass. I. Interdependence, group heterogeneity, and the production of collective action. American J Sociology. 1985; p. 522-556.

Parameter
[6] Olson M.    [23] The definitions we cite consider communities to be separate and distinct groups of people. But, our data is secondary and extracted from the online world and, therefore, we have to make a simplifying assumption. Throughout the paper, we assume that when a group of people congregates around a common theme (e.g., within a topical discussion forum or around a piece of online content), they form for this purpose an ad-hoc community that is separate from other communities on the platform. We ignore, in that sense, the fact that within and across platforms, the membership of communities may overlap. We hope that future research will tackle this distinction and explore the consequences of making this assumption.;.