Caveat emptor, computational social science: Large-scale missing data in a widely-published Reddit corpus

As researchers use computational methods to study complex social behaviors at scale, the validity of this computational social science depends on the integrity of the data. On July 2, 2015, Jason Baumgartner published a dataset advertised to include “every publicly available Reddit comment” which was quickly shared on Bittorrent and the Internet Archive. This data quickly became the basis of many academic papers on topics including machine learning, social behavior, politics, breaking news, and hate speech. We have discovered substantial gaps and limitations in this dataset which may contribute to bias in the findings of that research. In this paper, we document the dataset, substantial missing observations in the dataset, and the risks to research validity from those gaps. In summary, we identify strong risks to research that considers user histories or network analysis, moderate risks to research that compares counts of participation, and lesser risk to machine learning research that avoids making representative claims about behavior and participation on Reddit.


The Baumgartner Reddit Corpus
Trace data sourced from online platforms has become an essential component for many forms of research ranging from sentiment analysis (Pak and Paroubek 2010) to epidemiological modeling (Abdullah and Wu 2011) and economics (Bollen, Mao, and Zeng 2011).Dominant social platforms such as Twitter and Facebook have provided researchers with opportunities to directly study complex phenomena that, at their root, rely strongly on the nature of social interaction (Bond et al. 2012).The reason for this, as Tufekci argues, is that large platforms (specifically Twitter, in this analogy) serve as a model organisms for the social sciences, ones that allow for ideal conditions for measurement of many phenomena in a relatively accessible form.
On July 2, 2015, a new model organism was provided to researchers by Jason Baumgartner -a "complete" copy of one of the largest forums, Reddit, which has gained high visibility in the past several years due to events such as the Reddit blackout (Matias 2016;Newell et al. 2016;Baumgartner 2016) and the Gamergate controversy (Massa-nari 2015a).Subsequently, many researchers have adopted the dataset, and have used it to study a wide range of questions, including the evolution of social networks (Fire and Guestrin 2016), user migration through online platforms (Tan and Lee 2015;Newell et al. 2016), hate speech (Saleem et al. 2016), and online behavior research methodology (Barbosa et al. 2016b), among others.
As a social news platform, Reddit hosts discussions about text posts and web links across hundreds of communities called "subreddits" (Leavitt and Clark 2014; Massanari 2015c) Discussions from public subreddits are aggregated by a variety of news aggregators to create the "front page of the web" that Reddit was founded to provide to its readers (Massanari 2015b).While the site also provides chatrooms and features for live discussions of breaking news (Leavitt and Robinson 2017), the most common reddit experience is centered around top-level submissions and the comnents that people post when discussing those submissions within their subreddit communities.The Baumgartner dataset follows this common experience and includes submissions and comments.
Researchers are drawn to the Baumgartner Reddit dataset for its completeness.In principle, a complete dataset improves research validity by avoiding the ambiguities of samples provided by platform application programming interfaces (APIs) and third-party data resellers (Lotan et al. 2011;Diaz et al. 2016).In this paper, we show that this dataset, as distributed and used by researchers, is not as complete as reported.We report on gaps in this data, categorize the risks to research validity from these gaps, and share collaborative reanalyses of peer-reviewed papers that have used this dataset.Finally, we conclude with reflections on the sensitivity of online behavioral research to the kinds of gaps we found in the Baumgartner Reddit Dataset.

Sequential ID Analysis
The Baumgartner Reddit dataset came about through a convergence of factors: a mostly-public conversation platform, engineering details specific to the design of the Reddit system, and a creative data scientist who capitalized on these characteristics to contribute a unique dataset to public knowledge.
Many databases include the concept of an Identity column, or a column that generates an internal ID to serve as a unique reference to the row, or object, within the database.In many cases, this value auto-increments -the first value in the database assumes a value of 1, the next, a value of 2, and so forth.This number can be artificially shifted within the space -for instance engineers may partition early IDs of 1-1,000,000 for experimenting with data, for some reason, and start all production-system data created by users with ID 1,000,001.Aside from this possibility, if an object contains an ID of n, then it is plausible to assume that there are at least n objects within the database.
In personal correspondence, Baumgartner explained that this intuition led him to develop systems designed to systematically-collect all data on Reddit.Baumgartner's algorithm batches up 100 integers, converts them to the Base 36 representation that Reddit uses to represent their objects, and then queries for those objects.Reddit then returns the request with a set of all public, found objects.Baumgartner's algorithm can be run in a highly parallel environment -many batches of 100 IDs can be concurrently requested, with no need to interact with one another.On other platforms, some error may be returned if data has been deleted.With Reddit, no error is returned -instead, a truncated object reflecting that this deletion has occurred is returned.Therefore, barring technical issues, this method should provide a complete accounting for every ID within the range 1-n for all public comments and submissions within the dataset.Using this method, Baumgartner archived the public record of reddit comments and submissions from the platform's creation through July 2015.Baumgartner has continued to provide this data as a freely-available resource.
In this analysis, we consider The full dataset as released by Baumgartner in July 2015, supplemented with updates published by Baumgartner through the end of February 2016.We also include a followup analysis extended to June 2017.

Diagnosing Missing Data
Because reddit comments and submissions have unique, sequential IDs, we can analyze gaps in the sequence to evaluate the completeness of the dataset.We observed two kinds of missing information: dangling references (known unknowns) and gaps indicated by the absence of information that we would expect to exist given the use of sequential integers to index comments and submissions (unknown unknowns).
We discovered the completeness problem when working with this dataset for our own research.Taking a random sample of subreddits and generating a timeline of daily comments and submissions, plots showed impossible results given the architecture of Reddit: some comment timelines started earlier than their corresponding submission timelines.
The first kind of gap we discovered were dangling references.On Reddit, comments can only occur within a discussion of a submission and can only refer to other comments or submissions.In all cases, a submission would have to exist for a comment to refer to it, a relationship that is unidirectional in time.By traversing these relationships, we observed many references to missing comments and submissions.These can be thought of as "known unknowns:" comments which refer to other comments or to a parent submission, where the referred-to comment or parent submission is not contained within the Baumgartner dataset.
We also observed a second kind of gap: objects that are never referenced in the dataset but are likely missing.If all comments and submissions are given sequential integer IDs, we would expect an unbroken sequence of integers to be associated with information in the dataset.This is not the case.Consider the comments dataset: the earliest comment in the Baumgartner dataset is comment #2 and the highest is #29,484,960,643.In October 2007, the Reddit Company incremented the comment IDs by several billion IDs.When accounting for this difference, we assume that any other gaps in the sequence of comment IDs can be attributed to gaps in the dataset: we count 943,755 total potentially-missing comments up to February 2016.
Missing comment IDs could be attributed to many possible causes.These IDs could be dangling references, public information that for some reason were not returned by Reddit's systems to Baumgartner's software at that moment, or information that was part of a community that had set its discussions to be private.These missing IDs are not associated with deleted content, since the Reddit platform returns information about deleted data, which is included in the Baumgartner dataset.
For submissions, we are less confident about the magnitude of missing unknown unknowns.While we have observed 1,539583 "gaps" in the space of IDs for submissions through February 2016, the first submission in the Baumgartner dataset starts at 9,970,002.When searching for submissions between #1 and #9,970,001, we have successfully found some submissions, leading us to believe that millions of submissions from the early history of Reddit may be absent from this dataset.Deleted content, which is included in this dataset, represents a risk to validity that we do not consider here.A user who deletes even one comment in their posting history introduces many of the problems we describe in this paper, even if the fact of the comment is recorded in the Baumgartner dataset.

The Per-User Risk of Missing Data
How likely is a researcher to encounter these gaps?To address this question, we estimate the per-user risk of missing data, using a random sample of 7,400 accounts from the Baumgartner dataset.
The average user in this sample commented 6.8 times and commented 96.6 times from late January 2006 through February 2016.These averages occur on a highly skewed distribution, as illustrated by the log-histograms in 1.Based on table 1, the known maximum amount of missing comments and submissions is 943,755 and 1,539,583, respectively -dangling references are a subset of "unknown unknowns."Across the entire Baumgartner dataset, only 0.043% and 0.65% of comments and submissions, respectively, are missing.The issue has a compounding effect, since a small number of users create a large amount of the content on the platform.The more posts and comments someone produces, all else being equal, the more likely their histories will be affected by the missing data issue.As we have also showed, unknown unknowns expanded dramatically in the 16 months following February 2016 and now include 36 million missing comments and 28 million missing submissions.
What is the probability of data loss for an individual Redditor history?While in reality the missing data is not uniformally distributed throughout the corpus, we can estimate the effect by compounding probabilities to assess the degree to which a user could be affected by only a small amount of missing data.Using the averages from earlier, we can calculate the risk of any individual submission r s or comment r c being missing simply by n c r c and n s r s , respectively.In this case, the "average" Redditor may be exposed to a total maximum risk level of ∝ 4.18% likelihood for missing at least one comment and ∝ 4.46% for missing at least one submission.In the 7,400 individual set, approximately 2% of the sampled users had a 50% or greater chance of having a missing comment, and 2.6% of the sampled users had a 50% or greater chance of having a missing submission.These estimates were based on the census of dangling references and unknown unknowns from the beginning of the corpus to February 2016; we expect relatively similar rates in later data, since the rate of missed content has been consistent for the past several years.We offer these rough approximations to communicate a qualitative sense of how this missing data issue may create an appreciable problem for some forms of research.We include a more detailed typology of possible errors below.
Leaning on Jo et al., we employ a measure of "burstiness" which considers the relative dispersion of errors throughout the ID space per each month of gathered data.This measure is bounded from [-1,1], where a score of -1 indicates completely evenly dispersed errors, and a score approaching 1 indicates that errors are located in a more concentrated set of missing blocks.Figure 2 shows many high positive burstiness scores, indicating that missing blocks are often distributed unevenly within months throughout the dataset.

Distribution of Gaps Across Communities
We also considered the degree to which missing content differentially affects individual subreddits.If data from some communities were more affected by gaps than others, the gaps could influence the results of comparative research about populations communities (Hill and Shaw 2017).If gaps affected communities equally, we would expect that the number of missing pieces of content monotonically rises with the number of overall pieces of content posted to a subreddit.As figure 5 shows, we find only marginal evidence for such a supposition.While more missing content is positively and significantly associated with larger subreddits, we do not find a direct relationship.One confounding factor may be the temporal "center of gravity" of a subreddit -older subreddits are positioned at a time when more content was missing, on average, which may differentially affect older subreddits.Table 2: Regression exploring the relationship between amount of missing content per subreddit and total amount of known content per subreddit, and month in which the subreddit was created.We expect that these two variables would have meaningful explanatory power for where missing content is -we find that this appears to be the case for missing comments but not for missing submissions, as evidenced by the relative R 2 values.
We attempted to control for subreddit age in a multiple linear regression which accounted for the size of subreddits as well as the time at which those subreddits were created; we did not find any meaningful increase in explanatory power in the adjusted model.Table 2 .The time at which a subreddit was created, however, is a poor proxy for the true "center of gravity" of content (i.e. the time at which a subreddit was most active), a characteristic that these models do not account for.
In the above sections, we have considered the influence of potentially-missing content on analyses of users, behavior over time, and groups.We observed numerous sources of potential bias in research: a substantial percentage of users could be affected by these gaps, the gaps are not evenly distributed across time, and gaps are not evenly distributed across communities.

How Missing Data Affects Common Research
Methods in Computational Social Science How might these gaps influence research in practice?We expect that researchers asking different kinds of questions will face different kinds of risks from missing data.In the following sections, we categorize published literature that uses this dataset and offer a typology of the risks that these gaps represent to common research methods in computational social science.
User history analysis papers face the highest risks from missing data, since a missing comment or submission could hide an important part of that user's history.A network analysis may fail to include a user's participation in a particular community or interaction with a key user.Furthermore, survival analyses might mis-estimate the moment of a person's departure or their participation level.Network analysis papers also face high risks, since the presence or absence of a tie could be dependent on the missing data.Sum analyses that count the size or incidence rate of participation in sub-reddits or the use of certain kinds of language face moderate risk, especially when analyzing small communities and rare events.Content analysis that involves training machine learning systems on Reddit comments face minimal risk because their research rarely includes claims about the population of Reddit users.

Risk to User History Analyses
Papers that test hypotheses based on user histories on Reddit may have substantial gaps in the histories that they seek to test.Analyses on user histories that consider the history in full are, in general, exposed to the highest risk -analyses that are especially sensitive to high-volume users are very likely, on average, to consider users whose histories have gaps.Hessel, Tan, and Lee, for example, observes and compares sums of comment participation between subreddits, and observes the full chain of user history -Hessel et al. adopts a similar approach.Barbosa et al. compares year cohorts of individual-level behavior across all of Reddit, and as has been shown, some years are more affected by gaps than others.Additionally, the large number of potential missing submissions from Reddit's earliest years may also affect these findings.If a user history analysis requires the complete posting history between subreddits for a given user, gaps in such transmissions may constitute meaningful gaps in explaining a wide array of hypotheses.

Risks to Network Analyses
Some papers test network hypotheses by constructing interaction networks between users or communities, sometimes over time.Data gaps also represent a high risk to these papers, since missing submissions may result in unobserved ties in the network.Tan and Lee observes histories of user accounts participating in different communities, while Fire and Guestrin observe network ties over time modeled on user histories.Substantial blocks of missing data, including the potentially large amount of missing submissions from Reddit's nascency could redraw the map of community ties on the platform.Tree structures reconstructing threads are also similarly affected, such as work by Hessel, Lee, and Mimno and Fire and Guestrin, which through linkages of comments and submissions similarly face issues due to missing submissions (i.e.parents of threads) or comments.

Risks to Research That Counts and Compares Participation Between Communities
Other papers test hypotheses based on participation sums within communities.Gaps that are biased toward particular communities will represent a risk to the validity of these studies.Matias observes levels of subreddit participation by moderators, observes relative participation levels of subreddit commenters in other subreddits, and observes moderator participation in "metareddits".Newell et al. observes comment volumes within subreddits.Barthel observes comments about political candidates across Reddit during a period where many submissions are within the dataset.Barbaresi analyzes German language text to identify relative commenting rates about places in Germany.
Horne and Adali considers posts within /r/worldnews to determine linguistic characteristics of why some news frames are more visible than others.Dosono, Semaan, and Hemsley considers a specific set of communities associated with self-expression of Asian-American Pacific Islander (AAPI) identity on the platform.
As we showed in figure 5, gaps do not appear to be evenly distributed across communities, since the number of missing comments and submissions per community is not strongly correlated to the number of observed comments and submissions in that community.While a simple statistical regression between the total counts of missing data and known data shows the relationship to be significant, the R 2 is low enough in both cases to lead us to conclude that studies on some subreddits could lead towards very biased results due to higher than random amounts of missing data.
In practice, we observe 78 subreddits where at least 20% of the comments are missing, and 1,755 subreddits where at least 20% of the submissions are missing.Among subreddits that have any dangling references, on average they are missing at least 35% of their submissions.The R 2 score in a model predicting the volume of a community's missing observations from the volume of observed comments and submissions only explains 30% of the variance of missing comments and 10% of the variance of missing submissions (figure 5).The risk to any specific study will depend on the distribution of gaps across the specific communities being compared.

Risks to Machine Learning Models
Finally, some studies train machine learning models and conduct linguistic analysis of the Baumgartner dataset.Insofar as these studies do not make claims about populations, gaps represent a minimal risk to the validity of this research.For example, Saleem et al. trains machine learning models on comments from particular subreddits that have since been quarantined or banned by Reddit for harmful behavior.
In our observations of communities where the mass of missing data is pooled, it seems to trend towards such communities -across the three subreddits considered in their work, one of those subreddits has a large number of dangling references: observed comments refer to 696,642 unique missing submissions in the dataset for this one community alone.Among comments, 1,100 of 1,585,014 total comments were known to be missing.Saleem, Dillon, Benesch, and Ruths have re-analyzed their data after filling some gaps and fail to find any substantial differences in the performance of their machine learning models (citation forthcoming).Furthermore, since the purpose of this kind of machine learning research is to make inferences about outof-sample observations rather than to test hypotheses about a population, such research may be less sensitive to variation due to missing data.

Discussion
All datasets have biases, no matter how complete we wish them to be.In the process of designing research, conscientious researchers will study those biases, document them, and account for them as best as possible.In this paper, we have shown ways in which an influential public dataset does not represent the "complete" record that its publisher and users aspired to.We have documented per-user risks of missing data, risks from the uneven distribution of missing data over time, and risks in the uneven distribution of missing data across communities.We have outlined the risks to research validity represented by these data gaps, including some of our own work.
We have raised these issues in direct conversation with Baumgartner, who has quickly and graciously re-processed ID blocks with missing data and filled in any gaps that are able to be filled.By publication time of this paper, we believe that any missing data that can be filled will have been done so for datasets provided directly by Baumgartner up to February 2016.Data shared from any other source may still include these missing observations.Since any missing data that Reddit does not provide will still be missing from the corrected datasets, we encourage researchers to check the integrity of your data when publishing results from this dataset.
More widely, the case of this so-called complete dataset draws attention to the risks to validity from research cultures that move fast to produce new results when new data is released.While many researchers have utilized Baumgartner's generous work on this Reddit dataset to investigate important questions, too few of us questioned a "completeness" statement that shouldn't have been accepted as truth.This dataset has numerous omissions, and those issues affect different research agendas with varying levels of severity.
As researchers, we need to protect ourselves from the dazzling scale of large datasets.We encourage more people in Baumgartner's position to collect data, share it in an ethical manner, and contribute to knowledge through the research that it enables.It will not always be possible or reasonable to place strict methodological expectations upon such citizen scientists -that responsibility lies firmly on academics.We hope this paper will encourage other researchers to test their assumptions and document data quality when conducting social scientific research with large datasets that they did not collect.

Figure 1 :
Figure 1: Histograms of sampled user Submission and Comment counts

Figure 2 :
Figure 2: Burstiness of missing submissions and comments per month, 2005-June 2017

Figure 4 :
Figure 4: Varied measures of missing comments per month.Medium blue circles denote the percent of comments missing for each month of data, bright blue squares denote the rolling average percent of missing comments, and dark blue stars denote the cumulative total number of missing comments to date.

Figure 5 :
Figure 5: Gaps are not evenly distributed across communities.The total historical counts of comments per community comments are mildly correlated with the number of dangling references, while submissions are not very correlated with the number of dangling references.

Table 1 :
Totals for missing data in the Baumgartner dataset