Evidence of Online Performance Deterioration in User Sessions on Reddit

This article presents evidence of performance deterioration in online user sessions quantified by studying a massive dataset containing over 55 million comments posted on Reddit in April 2015. After segmenting the sessions (i.e., periods of activity without a prolonged break) depending on their intensity (i.e., how many posts users produced during sessions), we observe a general decrease in the quality of comments produced by users over the course of sessions. We propose mixed-effects models that capture the impact of session intensity on comments, including their length, quality, and the responses they generate from the community. Our findings suggest performance deterioration: Sessions of increasing intensity are associated with the production of shorter, progressively less complex comments, which receive declining quality scores (as rated by other users), and are less and less engaging (i.e., they attract fewer responses). Our contribution evokes a connection between cognitive and attention dynamics and the usage of online social peer production platforms, specifically the effects of deterioration of user performance.


Introduction
Performance deterioration following a period of sustained mental effort has been documented in settings that include student performance [1], driving [2], data entry [3], and exerting self-control [4]. Although the mechanisms for deteriorating performance are still debated [5,6,7], deterioration has been shown to be accompanied by physiological brain changes [8,9,10], suggesting a cognitive origin, whether due to mental fatigue, boredom, or strategic choices to limit attention. Outside of vigilance tasks, however, relatively little is known about whether and how this phenomenon affects online behavior. As our society  becomes increasingly interconnected and people spend more time interacting through various online platforms, analyzing online performance is important for understanding how content is produced and consumed [11], how information spreads [12,13,14], and how people decide what and who to pay attention to [15,16]. In this work, situated under the broad umbrella of user behavior modeling [17], we study online performance on Reddit, a popular peer production and social news platform. We measure online peer production performance as the quality of comments produced by Reddit users over the course of a session, defined as a period of activity without a prolonged break. The dataset we study contains over 55 million comments posted on Reddit in April 2015, and includes a variety of related meta-data, such as time stamps, information about the users, and the score attributed by others to each comment. We segment user activity into sessions, defined as periods of commenting without a break longer than 60 minutes, as suggested in [18] (cf. Figure 2). We link an individual's commenting performance over the course of a session to different proxy measures for a comment's quality, such as its length, readability, the score it receives from others, and the number of responses it triggers.
Our analyses uncover deteriorating online performance over the course of user sessions, with a decline in quality of subsequent comments across different proxy measures. Figure 1 illustrates the decline in the average score received by comments posted during sessions with ten comments: the data shows that each subsequent comment receives a rating that is on average 0.3 points lower than the preceding one. The size of this effect is quite large: It is equivalent to Figure 2: Sessions and randomization. Circles represent comments C i and arrows depict the time difference ∆t i,j between subsequent comments C i and C j . Sessions are derived by breaking at time differences exceeding 60 min. Original data sessions are shown in the first row. The middle row shows randomized sessions where time differences between comments are swapped for deriving new sessions while retaining the original order of comments. The bottom row depicts the randomized index data where sessions are retained but the order of comments within sessions is swapped. a 30% probability increase of receiving a downvote to a comment, for each extra comment posted after the first one in the session. To statistically study this effect, we design and implement mixed-effects models-allowing the incorporation of heterogeneous behavioral differences-which model the effect of session duration on the deterioration of online performance.
Our findings may be linked to effects of cognitive depletion: Exerting mental effort to compose a comment may diminish an individual's capacity to continue producing quality comments, whether through the loss of attention, mental fatigue, or simply the onset of boredom. Evidence also suggests that people, and other primates, have finite cognitive capacity for managing interpersonal relationships [19] limiting their amount of social interaction [20,21]. Only recently, our research community started investigating the possible relationship between cognitive limits and online interactions, showing the impact of information overload on user behavior [22,20,23,24]. Possibly, within-session deterioration of performance could explain the difficulty for users to continue exerting effort to discover information deeper in their social stream [16,15,25].
Although unveiling the mechanism(s) behind observed phenomenon goes well beyond the scope of the current study, performance deterioration occurs throughout various critical daily activities, including learning (e.g., prolonged study sessions) and self-regulation (e.g., coping with stress, inhibition, refraining from behaving, or sticking to dietary restrictions). We believe that shedding light on the complex interplay between cognitive limits and individual performance can further our understanding of human behavior in many contexts. Thus, showing initial evidence of online performance deterioration is important and we expect this work to have implications for both computer and cognitive sciences communities.

Results
Next, we present our findings on studying effects of session dynamics on online performance focusing on (i ) empirical observations, as well as by utilizing mixed-effects models on (ii ) performance at session start and (iii ) performance over the course of sessions. We study, after pre-processing, around 40 million Reddit comments posted in April 2015. We sessionize user behavior by periods of commenting activity without breaks longer than 60 minutes as suggested in [18] (cf. Materials and Methods section and Figure 2 for further details). For measuring performance, we look at four proxies of comment quality: text length, readability, the score a comment receives from others, and the number of responses it triggers. For comparison, we also study effects on two randomized session datasets as described in Figure 2. Figure 3 visualizes changes in online performance over the course of user sessions with respect to our quality features (comment text length, number of responses, score, and readability). Different colors and markers distinguish sessions of distinct length (i.e., number of comments written during the session) of up to a length of 5. The x-axis shows the session index of a comment, the y-axis shows the (population-wide) average of respective feature (with error bars). For example, in the first plot of Figure 3a, the red triangle at x = 2 refers to the average text length of all comments written in second position of all sessions of length 3. Figure 3a depicts the original session data of interest and suggests interesting dynamics in user behavior. First, all lines are stacked: The first comment of a longer session also starts out with a longer text, a higher score, more responses, and more complex text (evidenced by higher readability score). Second, all feature values decline throughout the course of a session hinting towards some form of performance deterioration. On average, the last comment of a session is shorter, receives a lower score and fewer responses, and is easier to read.

Empirical Observations
In contrast, these trends largely disappear in our randomized data-i.e., randomized session data shown in Figure 3b and randomized index data shown as in Figure 3c  . The x-axis depicts a comment's index within the session, and the y-axis gives the average feature value (with error bars). The first row (a) depicts the original session data while the second (randomized session data) and third row (randomized index data) visualize results for the randomized data. The results indicate that earlier comments in a session tend to be of higher quality than later ones. Additionally, there appears to be a relation between the session length and the performance of the first comment in a session (stacking of lines). These clear patterns for the original data (a) mostly disappear for both of our randomized datasets (b,c). Overall, these empirical insights suggest performance deterioration over the course of sessions.
by our way of randomizing-see middle and bottom rows of Figure 2. Some sessions (especially in the randomized session data) still stay partly, or sometimes even fully, intact, preserving the original session data. However, the effects are much reduced, for example, the average number of responses in the original data ranges between 0.55 and 0.77, while in the two randomized sets it ranges in the intervals [0.58 − 0.65] and [0.56 − 0.66] respectively. Several considerations limit the conclusions we can draw from these empirical results. First, the population-wide average feature value may not be fully indica-tive of user performance because some distributions (length, score, responses) are heavy-tailed. Second, we have only visualized sessions up to a length of 5. While visualizations including all lengths up to 10 show similar trends (not shown here), more detailed analyses are necessary. Third, and most importantly, we have ignored the fact that our samples are not independent of each other as we repeatedly measure comments for individual redditors. Each user's behavior may be different, for example, one user may tend to write very long comments, while another one may prefer making shorter ones; mixing these different behavioral aspects in one analysis does not allow for specific inference about performance deterioration. We resolve some of these issues by using mixed-effects models incorporating individual differences. We start with an (1 ) analysis of the performance on the first comment in sessions, based on our observation of a potential stacking effect, and continue with (2 ) experiments on performance deterioration over the course of sessions with respect to a potential decline in quality. We only report the most appropriate models and the significant fixed effect coefficients. More extensive experimental results and model analytics can be found online 1 .

Performance at session start
We hypothesized a relation between the length of sessions and their comments' respective quality; readily apparent in the stacking of lines in Figure 3a. We now statistically study this relation by focusing on the simplified question whether the length (number of comments) of a session has an effect on the performance of the very first comment in the session. We model the data with mixed-effects models specified as: feature ∼ 1 + session length + (1|user). The outcome (dependent) variable refers to one of our four quality features. The session length is the main fixed effect of interest. Additionally, we vary the intercept between users (random effect). For this analysis, we limit our data to only consider the very first comment in each session (around 23.5M comments). The detailed model analytics is openly available and can be found online 1 ("start" jupyter notebooks).
The results (fixed session length effects) are summarized in Table 1a. As hypothesized in our empirical population-wide observations, the results indicate that there is a positive relation between the length of sessions (i.e., the number of comments) and their first comment's quality. This is imminent from resulting positive fixed effects coefficients meaning that an increase in session length leads, on average, to an increase of the first comment's text length, the number of responses it triggers, the score it receives and its Flesch-Kincaid grade level which corresponds to higher complexity of written text.
A potential explanation for the observed effect is that users start with different capacities to make quality contributions depending on how many more comments they plan to compose. Another (opposite) explanation could be that a higher performance of the first comment encourages users to produce more comments leading to longer sessions. While we believe the first explanation is more plausible-text length and readability are not based on external success measures, and responses accumulate at a (somewhat) longer time scale-future studies should aim at answering these causal questions. Without resolving the nature of causality, the identified relation between session length and quality of the first comment has implications for the experiments we report below that model the dynamics of user performance during the sessions. We have now shown, empirically and statistically, a high heterogeneity between different sessions with respect to their length. Accounting for this (e.g., as a nuisance effect) in our models is thus necessary. Table 1: Mixed-effects model results. In (a), the models study the effect of session length on the quality of the first comment C 1 in a session; i.e., data only contains the first session comments. In (b), the models investigate the effect of the session index i on the quality of respective comment C i ; data includes all comments in sessions with more than a single comment. Each table highlights the most appropriate models for each quality features based on extensive model analyticslmer refers to linear mixed-effects models while glmer refers to generalized linear mixed-effects models. All coefficients are strongly significant as derived from model comparisons based on BIC statistics.

Performance over the course of sessions
We now turn our attention to the dynamics of user performance in sessions on Reddit. Our empirical insights so far have suggested a performance decline throughout the course of a session. We statistically study this hypothesis by investigating whether the index of a comment (relative position in session) has an effect on the quality of the respective comment. To that end, we apply mixedeffects models specified as: feature ∼ 1 + session index + session length + (1|user). Again, the dependent variable refers to one of our four quality features. The session index is the main fixed effect that we are interested in for studying performance declines. Our models include an additional nuisance effect controlling for individual session lengths as suggested by our previous experiments-model analytics confirm the importance of this factor. An additional random effect models the variations of the intercept between different authors. For this analysis, we consider all data for sessions having more than a single comment (around 24.5M comments). The detailed model analytics can be found online 1 ("dynamic" jupyter notebooks). We summarize the main results in Table 1b and again focus on the fixed session index effect. The results now indicate a negative effect of the session index on our respective quality features indicated by the four negative coefficients. This means that with duration of a session, the quality of comments decreases on average. The next comment in a session is of shorter text length, triggers less responses and a fewer score, as well has a lower Flesch-Kincaid grade level indicating easier complexity of written text. This argues for performance deterioration throughout the course of user sessions on Reddit.
To further confirm observed effect, we repeated the above experiments on the randomized data. For both the randomized session and index datasets, the session index effect is not significant for all features of interest indicating no performance depletion effect in the randomized data. This is in contrast to real session data analyzed above and also confirms that the effects do not simply arise as a result of the order in which comments are made, but their order within a session.

Discussion
Our work presents novel evidence of performance deterioration during prolonged online activity. By analyzing Reddit, a popular online social network that attracts millions of users, we showed that sessions with more activity are significantly associated with production of lower quality content, as measured by the length of the comment posted, its readability score, its average score and the number of responses it receives. In light of these findings, we developed a mixed-effects model that captures online performance deterioration. The code and results for all model analytics are available online 1 .
Our analysis can be expanded in several directions. For example, we have only accounted for the basic differences between distinct Reddit users in the mixed-effects models. Yet, a much more nuanced analysis of heterogeneous effects of online performance deterioration would be warranted. One interesting direction involves understanding whether all individuals exhibit the same levels of performance deterioration, or whether these effects vary from user to user. For example, we might find that all users consistently exhibit deterioration or that different subgroups of users exist, where some users might even show improvements in performance over time. Neuroscience studies found individual differences in working memory and other cognitive activities in the human brain [26]. However, it remains unclear from a physiological standpoint whether capacity to process or produce information varies from person to person [27]. Online performance deterioration may also depend on acquired experience (as a form of cognitive dexterity) with a system. A new, and thus unfamiliar, user in a system may experience faster performance deterioration than an experienced user, because e.g., the cognitive or attention cost associated with the same operations may be experience-dependent (this is particularly true for information discovery and content production activities). A computational study of online performance in this direction could be very valuable.
Additionally, other hypotheses can be studied, such as that performance deterioration depends on the topic (politics vs. funny images), the time of the day, or the intensity of sessions (shorter average time differences between comments). A further aspect to consider is, that we have considered all comments posted to Reddit as equal, meaning that we did not distinguish between those comments posted at the root of a comment hierarchy and those posted further down the hierarchy. Future research in that direction is necessary to better understand observed deterioration effect. For example, top-level comments might generally be of higher quality than low-level comments, or performance deterioration might be stronger for successive posts in the same submission thread compared to comments across submissions. These and similar questions can be studied by our proposed models. They are highly adaptable and fixed and random effects can be utilized to model these potential heterogeneous effects; for example, including a random effect allowing the deteriorating effects to vary between users could already allow us to make further inference about individual differences.
Although our study was confined to Reddit, performance deterioration may generalize to other online activities. Future studies are needed to identify the mechanisms leading to observed deterioration, whether through the loss of attention, mental fatigue, or simply the onset of boredom. Regardless of the causes, understanding the complex interplay between individual's cognitive limits and dynamic behavior is key to optimizing individual-and collectiveperformance in peer production and other online systems.

Materials & Methods
Here, we thoroughly describe utilized data, corresponding pre-processing steps, and statistical mixed-effects modeling approach.

Data
For studying performance deterioration we utilized a publicly available dataset 2 containing all comments (nearly 1.7 billion) ever written on Reddit starting from the first one on October 17, 2007 to the last one at the end of May 2015.
For our experiments, we extracted a smaller sample that limits the data to all comments posted in April 2015. An advantage of this limited data is that we do not need to additionally account for changes in Reddit's platform not only in its interface, but also in its voting mechanisms as well as the general usage patterns of users on the site [28]. Our results are robust to sample data from other months showing similar observations.

Quality features
For measure online performance, we studied the following comment quality features. Text length. This feature counts the number of characters in a comment and is an indicator for its textual length. Each URL in a comment accounts for one additional character. The overall mean of text lengths is µ = 168.08, the median is m = 86.00, and the standard deviation is σ = 281.88. Score. The score is a measure the perception of other users and is the difference between their up-and downvotes (the starting score is 1). All ratings can be summarized by the mean µ = 6.05, the median m = 1.00 and the standard deviation σ = 51.57. Number of responses. We see the number of replies a comment triggers as a proxy for engagement and a comment's success. We only count direct replies in the comment hierarchy. The mean number of responses is µ = 0.61, the median is m = 0.00 and the standard deviation is σ = 1.44. Readability. George Klare provided the original definition of readability [29] as "the ease of understanding or comprehension due to the style of writing". For measuring readability of Reddit comments, we use the so-called Flesch-Kincaid grade level [30] representing the readability of a piece of text by the number of years of education needed to understand the text upon first reading; it contrasts the number of words, sentences and syllables. It is defined as follows: 0.39 total words total sentences + 11.8 total syllables total words − 15.59 The lowest possible grade is −3.4 which e.g., emerges for comments that only contain a sentences having a single syllable such as "OK", only a single URL or only emoticons. We set the maximum Flesch-Kincaid grade to be 22. Simply put, a higher Flesch-Kincaid grade indicates higher readability complexity of a given comment. The overall mean of the Flesch-Kincaid grade is µ = 5.12, the median is m = 4.91 and the standard deviation is σ = 4.61. Correlation of features. Most of the features are not strongly correlated (Pearson's ρ) with each other; however, we can identify two special cases. First, readability and text length have a correlation of ρ = 0.296, which is not surprising given that shorter texts are easier to read, which is accounted for in the Flesch-Kincaid grade level formula. Second, the two success features score and number The log-scaled histogram shows a peak for very short time scales (minutes) and very long ones (1 day) suggesting daily routines. A natural valley emerges between both peaks arguing for the choice of a one hour break between comments for sessionizing.
of responses have a correlation of ρ = 0.558, meaning that comments that get a high score also tend to receive more replies. However, overall, these correlation results indicate that each feature represents interesting aspects on its own.

Sessions
We decided to take the time differences between consecutive comments as session indicators. To that end, we followed the approach advised in [18] where a strong regularity in how social media users initiate events across several different platforms was identified. Authors argue that a good rule-of-thumb is an inactivity threshold of 60 minutes to separate sessions. However, as postulated, we first visually and analytically inspect the log-scaled histogram of time differences between consecutive comments (after cleaning comments, before filtering sessions) as depicted in Figure 4. Similar to the results presented for other platforms [18,31], there is a peak for very short time scales (minutes) and a peak for time differences of one day suggesting daily routine. By fitting a Gaussian Mixture Model (using EM-algorithm, log-normal mixture) with two components to the log-transformed data, we end up with the two means µ 1 = 6.85min. and µ 2 = 794min. A natural valley is visible between the two peaks and thus, combined with the results from the log-normal mixture fitting, we follow the ruleof-thumb of [18] and pick a time difference ∆t i,j of one hour between consecutive comments C i and C j to separate sessions. Note that other (similar) choices of break time (e.g., 30 or 90 minutes) produce similar inference.

Data pre-processing
We took several steps for pre-processing and cleaning the data. First, we removedlist 3 , or (iii ) their account has been deleted; this accounts for around 4.5M comments. Second, we deleted all sessions containing at least one comment (i ) that has been deleted, (ii ) that is completely empty, or (iii ) that contains characters that are not in the ASCII character set (e.g., Chinese characters)-accounting for additional 3M comments. Finally, we removed all sessions containing more than 10 comments accounting for around 7.25M allowing for easier experimental tractability and the removal of further bot accounts. Note though that the inclusion of these sessions into the experiments does not change the main observations of this paper. Our final dataset contains 40, 064, 930 comments produced by 2, 669, 969 different users and posted in 47, 462 different subreddits.

Randomizing sessions
For comparison, we created two randomized datasets to which we applied our analysis. The first baseline-which we call randomized session dataset-attempts to preserve as much information as possible while randomizing the process of sessionizing user commenting behavior. To do so, we shuffled the time differences ∆t i,j between consecutive comments made by each user, but preserved all other features, including the temporal order of comments. Then, we simply sessionized user activity based on shuffled times. An example is provided in Figure 2 (middle row). This baseline dataset is very conservative in terms of randomization and retains many original sessions. For example, many parts of a session stay intact as only the short time differences are potentially swapped, which does not alter the sessions. The second baseline-which we call randomized index dataset-keeps the sessions intact, but randomizes the order of comments inside each session (e.g., exchanging C 1 by C 3 ). Thus, it does not preserve the original order of comments; see Figure 2 (bottom row). Multiple randomization iterations did not alter the results.

Mixed-effects models
For statistically modeling performance deterioration, we utilized mixed-effects models allowing for the incorporation of heterogeneous effects and behavioral differences accounting for the non-independent nature of longitudinal data at hand. Mixed-effects models include both fixed and random effects; following [32], we refer to fixed effects as effects being constant across levels (e.g., individuals) and random effects as those varying between different levels. An overview of mixed-effects models can be found in [33]. In our setting, the introduction of random effects enabled us to consider variations between different levels; the most important level being different users accounting for the inherent differences between individual Reddit users (e.g., the average quality of their comments). As highlighted in [34], mixed-effects models have further advantages, such as flexibility in handling (i ) missing data and (ii ) continuous and categorical responses, as well as (iii ) the capability of modeling heteroskedasticity. For simplicity, let us specify mixed-effects model equations using the following syntax [35]: outcome ∼ 1 + fixed effect + (random effect|level) This specification describes a model where an outcome (dependent variable) is explained by an intercept 1, one or more fixed effect(s), as well as one or more random effects allowing for variations between levels. For all our experiments, we utilize the lme4 R package [35] and fit the models with maximum likelihood. Examples about model specifications can be found online. 4 As each of our experiments is conducted on one of our four different features that all exhibit different properties-e.g., count (text length) vs. continuous (readability) data-we performed extensive model analytics to find the most suitable model for each problem setting. To that end, we focused not only on applying simple linear mixed-effects models, but also on (i ) various transformations, (ii ) generalized mixed-effects models, such as Poisson or negative Binomial regression, (iii ) model diagnostics, (iv ) further refinements that allowed us to compensate problematic data structures leading to e.g., overdispersion or zero-inflation, and (v ) checking for potential problems like multicollinearity, outlier bias, as well as convergence problems. For judging significance of fixed and random effects, we followed an incremental modeling approach starting with the most simple model only explaining the outcome by the intercept and then subsequently adding effects to the model. For comparing the relative fits of these models we used the Bayesian Information Criterion (BIC) [36] which balances the likelihood of a model with its complexity. All reported effects in this paper are significant, except where mentioned (randomized baseline data). For completeness, we also conducted additional significance test for the fixed effects such as t-tests or F-tests confirming our BIC diagnostics.
Instead of reporting the individual analytical steps for each experiment, we focused on reporting fixed effect coefficients, as those are the main effects we are interested in. However, we provide detailed reports for each experiment-based on a sample of 1 million data points-online in the form of jupyter notebooks 1 using R kernels. At the same time, utilized Reddit data is freely available 2 . Making our code and all experiments publicly available allows us to carefully document the results, as well as encourage other researchers to make their own inference and further refine our models.