Enhancing big data in the social sciences with crowdsourcing: Data augmentation practices, techniques, and opportunities

Proponents of big data claim it will fuel a social research revolution, but skeptics challenge its reliability and decontextualization. The largest subset of big data is not designed for social research. Data augmentation–systematic assessment of measurement against known quantities and expansion of extant data with new information–is an important tool to maximize such data's validity and research value. Using trained research assistants or specialized algorithms are common approaches to augmentation but may not scale to big data or appease skeptics. We consider a third alternative: data augmentation with online crowdsourcing. Three empirical cases illustrate strengths and limitations of crowdsourcing, using Amazon Mechanical Turk to verify automated coding, link online databases, and gather data on online resources. Using these, we develop best practice guidelines and a reporting template to enhance reproducibility. Carefully designed, correctly applied, and rigorously documented crowdsourcing help address concerns about big data's usefulness for social research.


INTRODUCTION
Big data and computational approaches present a potential paradigm shift in the social sciences, particularly since they allow for measuring human behaviors that cannot be observed with survey research (Lazer et al. 2009;Moran et al. 2014).In fact, the transformative potential of big data for the social sciences has been compared to how "the invention of the telescope revolutionized the study of the heavens" (Watts 2012:266).However, social scientists have been slow to embrace big data.One reason why is "the need for advanced technical training to collect, store, manipulate, analyze, and validate massive quantitates of semistructured data" (Golder and Macy 2014:144), training that remains nascent in many fields.But there are deeper, more fundamental constraints on the acceptance of big data among social scientists.
Despite its promise, big data's perceived limitations cast uncertainty on its applicability in the social sciences.Computer, information, and physical scientists have rapidly embraced big data because the information it makes available is unprecedented in those fields.Typical taxonomic efforts from computer scientists and others to delineate big data from traditional forms of data focus on these novel characteristics in what is called the "three Vs" framework (Hitzler and Janowicz 2013;Yin and Kaynak 2015): volume (or amount of data), velocity (or speed of data release), and variety (or data on rarely recorded activities).Volume, velocity, and variety are what make big data compelling and useful in a diverse array of fields.However, social scientists are concerned with two other Vs: validity 1 and value (Hitzler and Janowicz 2013;Monroe 2013).These additional Vs, which indicate authenticity or truth (validity) and 1 Other computational and information scientists refer to the 5 Vs of big data as including volume, velocity, variety, value, and veracity.Political scientists swap value and veracity for 'vinculation' (to bind together in a relationship) and validity, quipping: "[t]here are as many 'fourth Vs of B big data' as there are 'fifth Beatles'" (Monroe 2013:1).We stray slightly from this jargon and refer to veracity as validity in the rest of this paper to more closely match the language of social science methods.
what we can do with and get out of the data (value), are often lacking in big data research (Monroe 2013;Yin and Kaynak 2015).Characteristic of social science skepticism around big data are concerns that "[t]he reliability, statistical validity and generalizability of new forms of data are not well understood.This means that the validity of research based on such data may be open to question" (Entwisle and Elias 2013:1).Put bluntly, big data do not come from a heavily theorized and well planned scientific research project, which, at a minimum, creates discomfort among social scientists (Lazer and Radford 2017).
Without clear approaches to quantify and increase the validity and value of big data, we believe social science skepticism of big data will remain high.Researchers need to be convinced of the validity and value of big data, while simultaneously not adding to the cost of using big data, all of which we suggest can be accomplished through data augmentation.We define data augmentation as the process of (a) systematic assessment of measurement against known quantities or (b) expansion of extant data by adding new information.
Data augmentation is a standard technique throughout the social sciences that can assume a manual or automated approach.Traditionally, these tasks are accomplished using trained research assistants (manual) or specialized algorithms (automated) to detect erroneously coded data (validity) or append existing data sources with new material (value).An example of a manually augmented big data project is a study of posts made by high-schoolers on the Twitter social media platform that mention bullying.In this study, the authors used two human coders to classify whether each post that mentioned bullying (or bullied, or bully, etc.) was an actual report of adolescent bullying or whether it represented some other use of the relevant terms (Bellmore et al. 2013).In this case, the authors used data augmentation to increase validity.An example of automated data augmentation used to increase value is a well-known experiment on the social media platform Facebook (Kramer, Guillory, and Hancock 2014).In this experiment, the authors examined how respondents' purported emotions changed after being shown more purportedly positive or negative posts from friends, where emotions and their associated positivity or negativity was assessed by applying a sentiment analysis method to the words used in posts.
Sentiment analysis, in this case, serves as an automated way to gain additional information about big data (the posts), augmenting its value for research purposes.Of course, there are many more examples of both manual and automated approaches to data augmentation to add either validity or value or both (e.g., Maldonado et al. 2015;Bail 2016).
Unfortunately, data augmentation can be challenging to implement at the scale required for big data projects in a way that addresses social science skepticism.The manual data augmentation in the aforementioned study of bullying, for instance, was only feasible because the researchers examined a manageable number of messages (N=7,321).Automated data augmentation approaches, such as sentiment analysis, are also difficult to implement without advanced training and may themselves be of questionable validity.For instance, the automated augmentation used in the Facebook experiment discussed above has been criticized by social scientists for being of unknown, and potentially low, validity (Panger 2016).Of course, the validity of automated data augmentation approaches can be assessed and potentially improved through manual data augmentation, as is becoming more commonplace in big data projects through procedures such as supervised machine learning (Bail 2014), but the size and complexity of most big data would require a great deal of time and expense for knowledgeable trained coders (such as graduate assistants) to check.
In this paper, we argue that online crowdsourcing platforms can complement both manual and automated approaches to data augmentation, increasing the validity and value of big data in the social sciences at a low cost to researchers.We show that such tools are underused for nonexperimental designs in the social sciences and that workers on these platforms can rapidly and inexpensively verify automated coding, find errors in embedded metadata, and resolve missing data in many cases.We build this case in five steps: (1) review the use and perceived limitations of big data in the social sciences, (2) describe the online crowdsourcing process and its documented strengths and limitations as a platform for academic research, (3) investigate current practices in academic use of the largest online crowdsourcing platform, (4) conduct three case studies implementing online crowdsourcing to enhance ongoing sociological research and test the utility of crowdsourcing across different circumstances, and ( 5) draw on all of the above, as well as experiments embedded within the case studies, to produce evidence-based recommendations on when and how to implement online crowdsourcing to augment big data for best results.Finally, in light of the inconsistent and frequently incomplete reporting of online crowdsourcing procedures, we provide a recommended reporting template for online crowdsourcing as an academic data augmentation platform.We believe that this paper offers a clear roadmap for social scientists to begin incorporating more big data into their research designs, and we conclude by reflecting on the strengths and limits of online crowdsourcing approaches to data augmentation for these purposes.

Big Data Skepticism in the Social Sciences
Myriad actors such as corporations, governments, scientists, and even sports teams have embraced big data (Lohr 2012;Mayer-Schönberger and Cukier 2013;Murdoch and Detsky 2013) but adoption has been slow thus far in the social sciences (Lazer and Radford 2017).To understand how social science adoption of big data compares to its use in other fields, we searched Thompson-Reuters' Web of Science database in April 2017 for academic articles with the phrase "big data" (with quotes, not case-sensitive) appearing in the title, abstract, or keywords.The phrase gained its contemporary meaning in 2004; for a number of years thereafter, only a handful of isolated articles drew on the idea.Figure 1 shows the time series of papers about "big data" from 2009-2016, both overall and by some key fields.Beginning around 2011, overall use of big data began to increase exponentially.The increase has not been even across fields, however, as growth has been concentrated in computer science and other computationally intensive fields like engineering.By contrast, social science use remains minimal, with, for example, only 70 publications categorized as sociology between 2004 and 2016 (1.37% of all publications listing big data). 22 Thompson Reuters' classification scheme for research areas may not correctly identify sociology articles, or sociologists may be publishing big data articles in non-sociology journals.We do not feel that these possibilities restrict our general conclusions, because, in either case, sociologists will experience less exposure to big data articles.A related concern is that other fields simply produce more research than the social sciences, thereby accounting for the small role of the social sciences in big data research.However, research into article counts by discipline do not indicate the levels of disparity seen in Figure 1.For instance, Jaffe (2014) shows that the social sciences and psychology produced approximately 150,000 articles in 2011, whereas engineering produced approximately 250,000.The literature indicates that the primary reason social scientists are making relatively rare contributions to big data research is that these fields hold deep skepticism about big data deriving from the fact that it is not designed for academic research (Lazer and Radford 2017).Even those optimistic about the promise of big data critique its validity and value, including its lack of standardized reporting (K.Lewis 2015), poor measurement (Diesner 2015), decontextualization (Bail 2014), and tendency toward "big data hubris" (Lazer et al. 2014) that ignores threats to validity (Adams and Brückner 2015;Park and Macy 2015).Generalizability is another concern; most big data studies do not proceed with a clearly conceptualized population to which inference In general, the primary means of assessing and increasing the validity and value of data in the social sciences is undertaken through what we refer to as data augmentation.As reviewed above, there are both manual and automated approaches to data augmentation, but neither is likely to be sufficient to rise to the scale of the problems posed by big data and address social science skepticism about it.Instead, we focus on a third option that can enhance both automated and manual approaches to data augmentation: using online crowdsourcing marketplaces such as Amazon Mechanical Turk (MTurk).Online crowdsourcing is less technically demanding than automated approaches and can provide supplemental evidence of accuracy based on user judgment or augmented comparison with outside sources or both.Compared to common manual approaches, MTurk is nimbler and less costly, allowing increased scale of augmented analysis.
Compared to purely automated approaches or even blended approaches like supervised machine learning, online crowdsourcing through MTurk has the ability to produce well-understood measures of validity like inter-rater reliability or to merge data with sources that are not amenable to automated discovery, as well as retaining the reassuring feature that actual people have examined the coding.While some social scientists are using MTurk for research (Flores 2016;Gaddis 2017), we argue that formalizing this approach to data augmentation will expedite the widespread acceptance of big data in the social sciences and overcome barriers to its application.In the next section, we review MTurk as a promising research platform that we argue allows researchers to undertake big data augmentation at scale more simply, quickly, and cheaply than data augmentation through traditional automated or manual approaches.

MTurk as a Research Platform
The name "Mechanical Turk" is derived from the 18 th century chess-playing "machine." The original Mechanical Turk consisted of a complex cabinet of gears with a magnetic chessboard on top and a model of a human similar to a mannequin dressed in Turkish robes with a turban.Human chess players could play against the "machine" and would often lose.The Mechanical Turk toured Europe and the United States throughout the late 18 th and early 19 th centuries.However, the Mechanical Turk was a hoax as it was not an automated machine but rather an elaborate fake with a man inside playing the actual chess game (Levitt 2006;Standage 2004).
Thus, Amazon named their own version after the original Mechanical Turk to indicate that humans can still do things that computers cannot.Amazon's MTurk is an online crowdsourcing marketplace that brokers what MTurk parlance refers to as Human Intelligence Tasks (HITs) between requesters and workers3 .The idea of a HIT is described succinctly by Amazon: Amazon Mechanical Turk is based on the idea that there are still many things that human beings can do much more effectively than computers, such as identifying objects in a photo or video, performing data de-duplication, transcribing audio recordings, or researching data details.Traditionally, tasks like this have been accomplished by hiring a large temporary workforce (which is time consuming, expensive, and difficult to scale) or have gone undone. 4nyone eligible for employment in the U.S. or India can work on MTurk, although task completion requires reliable internet access.U.S.-based MTurk workers are typically younger, more educated, wealthier, more technologically savvy, and less racially diverse than average Americans (Berinsky, Huber, and Lenz 2012;Krupnikov and Levine 2014;Paolacci and Chandler 2014).As such, many worry that samples drawn from MTurk are less representative than population based surveys (Berinsky, Huber, and Lenz 2012), though not as fraught as convenience samples (Buhrmester, Kwang, and Gosling 2011).
However, when considering MTurk as a big data augmentation platform, as we propose, rather than a population to sample and survey, we argue that work quality matters more than worker representativeness.MTurk workers tend to pass screening tests at high rates (Berinsky, Huber, and Lenz 2012) with high reliability between (Behrend et al. 2011) and within workers (A.R. Lewis et al. 2015).Recruiting workers for data augmentation tasks through MTurk has three major limitations.First, workers lack specialized area knowledge; second, they cannot access restricted information (e.g.workers cannot download most academic journal articles); and third, MTurk compensation is based on task completion, not time, which presents challenges for fielding complex, judgment based tasks (Goodman, Cryder, and Cheema 2013;Krupnikov and Levine 2014).We return to these ideas below.For now, it is worth noting that these limitations mean that crowdsourced tasks are most appropriate for data augmentation when they can be broken into concise and unambiguous chunks using non-confidential information.

MTurk in the Academy
MTurk is popular with academic researchers; a recent Pew Research Center report (Hitlin 2016) found that academics posted the plurality (36%) of all HIT groups during one week.
Academics have hailed MTurk's low costs and rapid results, and even expressed cautious optimism about it as a survey platform (Horton, Rand, and Zeckhauser 2011;Weinberg, Freese, and McElhattan 2014).Its feasibility and reliability for big data augmentation, however, remains unexplored.
To better understand how academics use MTurk, especially for data augmentation, as well as how they report on such use, we conducted a content analysis of a random with-replacement sample of 100 articles from Web of Science matching the topic search "mechanical turk" and published between 2011 and May 2016.The search, performed May 23, 2016, returned 767 total records.We removed eight false matches, one poster, and three papers we could not find, yielding a final sample size of 88 articles (80 unique; statistics below are weighted for replacement sampling).In the online supplement, we provide metadata about these articles.We Over half (61%) of the papers we examined were in psychology and related fields (psychiatry, social psychology, and cognitive science), followed by business and organizational fields (10%), computer science and engineering (9%), and (non-mental) health fields (6% The results of our content analysis highlight that academic use of MTurk remains concentrated in psychological fields, and for experimental studies, piloting, and surveys.In contrast to this typical use, we advocate that researchers expand their use of MTurk for augmenting big data studies to address concerns about validity and value.We found that researchers are beginning to do this, but they do not offer enough detail on the process for it to be formally evaluated.To this end, the remainder of this article examines three case studies and focuses on developing clear, evidence-based guidelines for best practices on when and how researchers can augment data with MTurk and report on doing so.

Case Studies
We now present three case studies that apply MTurk to diverse sociological subfields to augment big data (cases 1 and 2) or test MTurk's data augmentation capacities against known benchmarks from ongoing sociological data collection (case 3).These cases allow us to compare MTurk to other data augmentation approaches, both automated and manual.For cases 1 and 3, we collected analogous data automatically and manually, enabling validity comparisons.We also embedded design experiments in cases 2 and 3 to test how HIT design and implementation can affect cost, quality, and worker experience.Our goal is to develop intuition for the benefits of big data augmentation through online crowdsourcing and how researchers can best move forward with such projects.
We designed all HITs based on past recommendations (Buhrmester, Kwang, and  Rather than training internal coders to verify these results, we tested the data augmentation capabilities of MTurk.We did so by creating three sequential tasks that split the process of validating the algorithmic coding of faculty members' fields into discrete steps.First, we asked workers to find the departmental webpages of a random sample of faculty members using a search link that limited results to the official website of their academic institution (see discussion and appendix for details).This step provided a sample of faculty whose academic field could be externally validated.Second, we asked workers to verify links obtained in task 1 and indicate whether each faculty member was listed in any of the 10 most common department names in the algorithmically coded field.This step helped to ensure that the links for specific faculty were correct.Finally, in the third task, we asked workers to evaluate whether any field on the faculty member's page is associated with the field that was algorithmically assigned.For instance, if a faculty member listed "speech and pathology" as their field and the assigned field is "speech and hearing sciences," we aspire for workers to select that these fields are associated.
This step constituted our primary interest, quantifying the validity of the algorithmic coding.We adapted all tasks from MTurk templates using the HTML and JavaScript programming languages, and collected them from separate but potentially overlapping pools of workers within the MTurk interface.A graduate research assistant invested approximately 40 hours in learning and managing this MTurk data collection.In all, we used MTurk data augmentation to check 2,043 automated classifications of faculty member fields, at a total cost of $590 including fees and pilot costs.

Study 1: Academic Affiliation -Results and Discussion
Were MTurk workers, operating without substantial oversight or prior training, able to validate the results assigned by algorithm?This case speaks to MTurk's ability to add validity to big data, used here to confirm the automated coding of a large data set and bound rates of coding error.Table 2 summarizes the combined results for Case 1.Workers in the initial HIT successfully located 85% of faculty, mostly on preferred page types (faculty homepage, administrative list, or curriculum vitae).Subsequent workers flagged only 3% of URLs that prior workers submitted as referring to the incorrect person or institution.Of cases with unflagged URLs, workers identified 94% of faculty members as matching either the field or department we provided, which suggests that the original automated coding of these big data succeeded at a high rate, even allowing for the possibility of substantial worker error.Mean hourly worker pay in this case ranged from $7 to $16 and was higher for workers completing multiple HITs.5This case revealed some important lessons.Early pilots combined all stages (page location, department classification, and field classification) into a single HIT, but we found that workers took longer and gave flagged results more often in such conditions.With later pilots, we found that dividing tasks into the three steps outlined above minimized worker time and let us build in cross-verification tests where subsequent workers verified both the faculty web pages and affiliations provided by earlier workers.

Study 2: Linking to OpenLibrary -Overview and Methods
Our second case highlights how data augmentation with MTurk can enhance the value of big data.Here, we asked workers to connect data sources (adding value to big data), and we experimentally tested how HIT design may affect work quality.This case builds on a project investigating book co-purchasing patterns connecting cultural groups, operationalized with retailer metadata scraped from the web.Unfortunately, necessary metadata were often incomplete, missing, or of questionable quality.For example, a book written by the founder of one Protestant denomination (Martin Luther) was listed as the top-selling item associated with a completely different denomination.To supplement missing information, we matched 1,055 (58%) books to additional metadata provided by OpenLibrary.orgusing international standard book numbers (ISBNs), a unique code identifying books.For remaining unmatched books, we tested MTurk's data augmentation capacities by asking workers to search for the books on OpenLibrary.As an experiment to determine means of improving HIT design, we randomly assigned each worker into one of three task variants.The first variant included full instructions with design features to enhance clarity (e.g.highlighting key text); the second used brief instructions but retained design features; while the third included full instructions with minimal formatting.Figures 2-4 provide screen shots of each condition; note that Amazon uses the ${variable name} notation as code to substitute values from input data provided by the requester (code available in supplemental files).Case 2 workers successfully found 283 potential matches (37%) for missing books in the original data.We followed up on HITs with comments and rejected submitted URLs outside the specified page types.A researcher checked every 20 th HIT returned for accuracy during data collection and found very low rates of false matches (<1%) and false negatives (5%-10%).
Checking during data collection (rather than using a simple random sample of all returned HITs) provides opportunity to save money by canceling remaining unclaimed HITs if design flaws are discovered.Consistent with case 1, the 33 workers who completed only one task in this case averaged 298 seconds, but the 50 workers who completed multiple tasks averaged only 126 seconds per task.Total cost for this case including fees was $235.
The experiment we embedded in this case illuminates how HIT design affects cost and quality.Workers presented with detailed instructions and design features spent less time per completed HIT (mean 171 seconds, S.D. 145) than those provided concise (230, S.D. 317) or minimally formatted (245, S.D. 233) instructions.However, because of the small cell sizes in this task, such differences are not significant with two-tailed T-tests; nonetheless, we take the magnitude of the differences to indicate that better instructions are likely to yield better results.
Though there is a general concern that paying workers per task may lead them to rush and skim longer instructions, yielding lower quality work, we did not find that this approach compromised accuracy in our testing.Instead, work accuracy in all three groups was high and statistically indistinguishable.We speculate that fuller instructions may reduce cognitive demands on workers and thus lead to lower completion times with comparable accuracy.

Study 3: Mental Health Websites -Overview and Methods
Our third case study does not focus on a big data project directly.Instead, it tests the possible extent of MTurk's data augmentation capacities and directly evaluates MTurk data augmentation against a "gold standard" benchmark from a set of trained coders in an existing sociological data set.Tthis case reveals how task complexity affects MTurk results and it provides alternate methods of assessing the quality of MTurk data augmentation.In this case, we compare the performance of trained coders against MTurk workers in a study of college student mental health.The Healthy Minds Study Institutional Website Supplement (HMS-IWS) collects data on 74 topics across 8 areas related to resources, information, and the presentation of information on mental health services from college and university websites.It is, itself, adding value to a standard survey (the Healthy Minds Study) through manual data augmentation.
For three years, the HMS-IWS team, including a Ph.D. researcher and two trained graduate research assistants, have each coded relevant items from institutional websites.There is high inter-rater reliability in this manual data augmentation approach but also extensive costs and time.In this case study, we asked 40 MTurk workers to record information from one of three college or university websites.We provided workers with a brief explanation for each task (see Appendix) as well as the website link.We varied HIT construction across four categories to test how HIT organization and design affects work quality and cost.In HITs 1A and 1B, we gave workers a set of 21 items (18 yes/no and 3 open-ended) spanning four broad categories (general information, campus-specific information, information for individuals other than students, and diagnosis) and paid $1.50 for the task.In HITs 2A and 2B, we gave workers a set of 33 items that fit under a single category (services and treatment), including 30 yes/no and three openended questions, and paid $1.75 for the task.Finally, we varied the HITs between versions A and B, with the sole difference between versions being the addition of a paragraph in the B variants that told workers we would check accuracy and that users with too many inaccurate answers would not receive payment.

Study 3: Mental Health Websites -Results and Discussion
To evaluate worker accuracy, we compare results from the trained coders, which we take as a gold standard benchmark for accuracy, to results from MTurk workers.Three trained researchers first coded each of the 48 binary items for each of the three websites.The researchers agreed on 131 of the 144 total items across the three websites, and the remaining 13 items were checked again for accuracy.In contrast, MTurk workers correctly answered binary items at a rate of 63% for HIT 1A, 70% for HIT 1B, 78% for HIT 2A, and 82% for HIT 2B.Given the binary response choices, these rates are generally low.They do not improve when we use a consensus rule to aggregate MTurk responses to the same question: assuming an item's majority answer was correct would have resulted in errors for 31% of items.The accuracy difference between HIT 1A and HIT 1B is significant using an unpaired t-test (p<0.05), while the difference between HIT 2A and HIT 2B is not significant under the same test.The pooled difference between HITs 1 and HITs 2 is also statistically significant (p<0.001).Moreover, the pooled results show that individuals given the A variants were more likely to have a low accuracy rate than those seeing the B variants at a rate of 22% to 8%, respectively (p<0.05).
In evaluating this case, we discovered an additional finding that pertains to best practices for MTurk data augmentation.Researchers might be tempted to proxy data quality with task completion time, discarding work completed in the shortest or longest amount of time, or both.
However, we found little benefit from doing so.The correlation between accuracy and completion time is 0.34, and falls slightly (to 0.29) if we remove work completed in the bottom decile of completion times.If we remove work completed in the top decile, it increases (to 0.48).
Removing both changes the correlation only marginally (to 0.44).On this basis, we conclude that completion time is a weak indicator of work quality.Some who complete the task quickly may simply be good at it, while some taking the longest amounts of time may have stepped away from the computer without sacrificing work quality.Recall that MTurk workers are paid by the task, not by completion time.
Overall, results from this case show that not all data augmentation tasks can be done effectively by online crowdsourcing workers.We focused on simple yes/no questions and received a 63% accuracy rate in one HIT iteration, only marginally better than random chance.
However, we can draw other important conclusions about using MTurk for data augmentation from this case: alerting workers to the possibility of payment loss from sloppy work improves accuracy (consistent with Corrigan-Gibbs et al. 2015), as does the careful ordering of work into logical groups.Finally, researchers should be careful when evaluating work accuracy, as high error rates were maintained under consensus coding and showed little relationship to completion time.

DISCUSSION
The use of online crowdsourcing for survey and quasi-experimental research is gaining acceptance.A series of studies that compare the results of parallel surveys and experiments using MTurk and traditional methods have evaluated online crowdsourcing with generally positive assessments (Berinsky, Huber, and Lenz 2012a;Clifford, Jewell, and Waggoner 2015;Weinberg, Freese, and McElhattan 2014).Our content analysis of published social science papers that use MTurk indicated that such evaluations have generated a set of informal norms around design and reporting for experimental and survey-style MTurk studies.
We argued that online crowdsourcing as a data augmentation platform holds unique potential to add validity and value to big data at low cost, and our content analysis suggests that researchers are beginning to use it for these purposes.However, in contrast to the emergence of norms for experimental and survey research with online crowdsourcing platforms, we found little evidence of standards for the design and reporting of data augmentation with such tools.We addressed that gap in the literature by presenting a series of three case studies designed to consider specific big data augmentation challenges, test MTurk data augmentation against known benchmarks, and improve the research community's understanding of best practices of data augmentation through online crowdsourcing.
In this section, we consider the implications of both the content analysis and our three case studies in the context of past recommendations about online crowdsourcing.We aim to provide evidence-based guidance for two types of researchers: (1) those exploring the viability of online crowdsourced data augmentation for a project, and (2) those seeking to improve the validity and value of data augmentation efforts with online crowdsourcing.Finally, we hope that future researchers, reviewers, and editors will find these considerations valuable when evaluating data quality and reporting adequacy in online crowdsourcing studies, so we offer a model reporting template in the appendix in service of this purpose.

Strengths and Limitations of Using Online Crowdsourcing for Data Augmentation
Our three case studies test whether and when online crowdsourcing is practical for adding validity and value to big data projects.We found that data augmentation through online crowdsourcing platforms performs best in instances like case 1, where target data are clearly defined and standardized, but it is too time-consuming, challenging, or costly to automate information recovery or for trained coders to manually recover and evaluate this information.In such tasks, workers on online crowdsourcing platforms can find and code information quickly and efficiently.The results of case 2 suggest that researchers must consider the importance of the specific output data and likely return on investment before fielding HITs.While results in this case were accurate, most books lacked a match, reducing the effective value of data augmentation through online crowdsourcing.Nonetheless, were this case focused on a larger project with tens of thousands of missing records, for instance, perhaps substantial could be gained.Case 3 looked at MTurk's potential for research beyond simple big data augmentation tasks, but it offers a more cautionary tale, wherein the non-specialized skills of online crowdsourcing workers and task completion incentives led to poor accuracy.While data augmentation through online crowdsourcing may not satisfy the complex needs of standard sociological studies such as the HMS-IWS, it can still save time and cost when used for smaller, more straightforward portions of the data collection process that would be necessary with big data augmentation.
To the extent that each of the following are true, we argue that using online crowdsourcing to augment big data should be considered more beneficial for potential cost and time savings: 1. Data collection cannot readily be automated.
2. Data can be found and/or coded by web-savvy persons without special training or knowledge.3. Analytic needs for data are factual and do not include population estimates or comparisons with under-represented groups (minorities, individuals outside the US/India, older Americans, etc.).4. Factual tasks can be split into smaller chunks without substantial duplication of effort.5. Rapid results and the ability to test alternative instruments (e.g.pilot tests) are advantageous.

Best Practices for Academic Requesters
Given the broad range of goals, methods, and tools used by academic requesters, this section provides evidence-based guidance for maximizing the validity and value of big data augmentation using online crowdsourcing marketplaces.It assumes a researcher's goal is data augmentation, but it is also broadly applicable to surveys and experiments, with differences as noted.Once the decision has been made to use online crowdsourcing for data augmentation, a typical workflow includes three phases: design, collection, and analysis.
The design phase is most critical; it sets conditions for success in subsequent phases.
Clear visual design and precise, jargon-free instructions increase worker efficiency and lower the post-collection burden on requesters to manually check data quality.Based on experimental tests in cases 2 and 3, we recommend providing comprehensive instructions and examples, but highlighting (through size, color, placement, etc.) the most important instructions for task success, as well as how work will be evaluated in payment decisions.follow-up questions to later tasks or collecting data for a sample rather than every case poses little threat to data quality.The ease of redeployment and incremental expansion generally make it better to wait when unclear whether a researcher will need a specific piece of information, preparing follow-ups as necessary.
We refer to the splitting of work into smaller and more coherent tasks as related task grouping and advocate that it improves work quality.Compared to initial single-shot versions of study 1, splitting the design into three HITs decreased cost and improved accuracy.Smart chunking lets workers self-select into tasks and not feel constrained to finish a longer task poorly to avoid sunk time.In both studies 1 and 2, a small proportion of the total number of workers completed most HITs, spending less time per HIT with at least equal accuracy.Related task grouping also avoids overpaying for work that is not completed.For example, a common application of big data augmentation through online crowdsourcing is asking workers to answer questions about a specific web link.If the link is invalid, any subsequent questions are inapplicable.If finding the initial links is also a goal, devoting a single task to identifying a suitable web address and asking subsequent workers to verify web address accuracy can save on excess pay while also providing cross-verification of the initial task's success.
Big data augmentation with online crowdsourcing is often swift and hands-off once HITs are posted, but some simple steps before, during, and immediately following HITs can improve data quality and requester reputation.Before activating a HIT, requesters can freely specify minimum worker qualifications, such as by only requesting workers with evidence of past task success or who have completed pre-tests (Leeper et al. 2015; Mason and Suri 2012 discuss tools for requesters more extensively).Requesters should also monitor their registered email during and immediately following HIT batches, as workers may contact them when they are unsure about the appropriate response, to report unclear directions or glitches, and to appeal rejections.
Many circumstances, including browser malfunction, accidental user error, or common mistakes can result in rejection of ambiguous or good work, so researchers often accept all complete HITs and later remove poor quality data.
Of the phases of online crowdsourcing implementation, scholars have paid the least attention to analysis and reporting.The variety of big data, their relative lack of structure, and the priority of computer science and engineering over the social sciences in the field have contributed to inconsistent reporting.For data augmentation with online crowdsourcing tools to increase the validity and value of big data, transparency is imperative as to the procedure used to collect the data, how their integrity was verified, and relevant information on workers.
We provide a recommended reporting template in the appendix with both standard items that should be included in reporting all online crowdsourcing studies and items to use in reporting specifically for big data augmentation.We recommend researchers report on key study features, its purpose and implementation, and the exact criteria that they used to determine data quality, including at least one of several potential validity checks.Whenever possible, we suggest that both instruments and output data should be made available through public data repositories, such as the Dataverse network (www.dataverse.org) or other publicly accessible sites, such as Github repositories.In either case, standard confidentiality practices should be observed in removing unique worker numbers and other personal identifiers before publishing data, and researchers must adhere to relevant human subjects research guidelines when appropriate.
Worker compensation is a final issue that deserves discussion.Typical worker compensation among the few academic studies that report hourly pay on MTurk is $1-2 per hour, rates that prior work suggests produce reliable results (Buhrmester, Kwang, and Gosling 2011).These rates, however, are far below U.S. minimum wages and legal only because MTurk workers are self-employed contractors not subject to minimum wage laws.Buhrmester and colleagues (2011) found that compensation was not the most commonly cited motivation for workers, but recent findings suggest many workers rely on MTurk as primary or supplemental income (Hitlin 2016; L. Irani and Silberman 2014;Litman, Robinson, and Rosenzweig 2015). 6We worry that such low payment rates can damage the broader research community by hurting the reputation of academic researchers.A 2014 experiment (Benson, Sojourner, and Umyarov 2015) estimated that HITs from requesters with good reputations in the online review forum Turkopticon recruit workers at twice the rate of those with poor reputations (Silberman 2015;L. C. Irani and Silberman 2013).We encourage researchers who wish to estimate costs to collect a small pilot study and target average hourly compensation of at least the U.S. federal minimum wage (currently $7.25).

Conclusion
This paper offers data augmentation through online crowdsourcing as a means to address common concerns regarding big data in the social sciences, because doing so can add validity and value at low cost to researchers.Whereas prior work has focused on the generalizability and ethics of big data, issues of validity and value have received considerably less attention.At the same time, while many have used online crowdsourcing marketplaces such as MTurk for drawing samples, or for experimental studies, few researchers have used them for data augmentation.In this paper, we attempted to bridge these literatures.We reviewed existing practices in academic research using online crowdsourcing and considered three empirical cases where big data augmentation through crowdsourcing enhanced ongoing research or illustrated the limits of data augmentation with such tools.Based on these analyses, we provided general guidance and best practices for academic research that uses online crowdsourcing for data augmentation and a standardized reporting framework.Although we emphasized the use of online crowdsourcing for big data augmentation, many of our findings and recommendations may be of value to researchers considering online crowdsourced labor for other tasks like fielding surveys.There is substantial promise in using online crowdsourcing to free up research assistant time without the need for highly-skilled programmers, and this paper offers some first steps to formalize knowledge about the potential for using these tools.* Unless identical across batches, items should be reported for each batch of data collected using MTurk + We recommend these items be included in reporting table as the URL of an online repository

Figure 1 .
Figure 1.Numbers of articles with topic "Big Data" overall and matching select fields, 2009-2016 address three questions in this content analysis: a) who uses MTurk for academic purposes, b) what is it used for, and c) what details are reported about the use of the platform.
Gosling 2011;Paolacci, Chandler, and Ipeirotis 2010;Berinsky, Huber, and Lenz 2012) and revised them according to common worker concerns voiced in online MTurk forums (e.g., http://www.turkernation.com) and our own piloting.We collected all data between October 2015 and July 2016.The online supplement provides full versions of instruments and de-identified results.Study 1: Academic Affiliation -Overview and MethodsOur first case shows how MTurk can enhance the validity of big data.It is part of a larger project on the role of interdisciplinary dissertation committees in knowledge production .The original project used an algorithm to code the academic field of faculty based on their roles in doctoral committees.For instance, if a faculty member chaired committees in one field and was a member of committees in another, the algorithm assigned them to the field in which they chaired.Most cases were less clear cut, however, and required more complex assignment rules reviewed in greater depth in the original paper.Such algorithmic assignment indicated a surprising amount (56%) of interdisciplinary dissertation committees.The credence given to these prevalence statistics, however, hinges on the accuracy of the automated coding.This represents a classic concern voiced by social science skeptics about automated augmentation of big data.For instance, compare the critique of sentiment analysis in the aforementioned Facebook experiment(Panger 2015;Kramer, Guillory, and Hancock 2014) or concerns about search term inclusion in Google Flu(Lazer et al. 2014;Ginsberg et al. 2009).Manually verifying a sample -manual data augmentation -represents one way to check result accuracy, however, our tests indicated that finding and hand coding the fields of a sample of 2,000 of the 66,901 faculty (3%) would have demanded over 230 hours of trained coder work.This time commitment translates to more than three quarters of a semester of typical graduate research assistant support, assuming a 15 week semester at 20 hours a week.

Figure
Figure 2: Experimental Variant 1 for Study 2 (complete) Formative pilot studies can help to identify problems with design.If using external tools, such as pairing MTurk with survey administration platforms, it is vital to pretest HITs and ensure the correct operation of validation codes that verify external task completion.Malfunctioning codes are a common complaint on worker forums, as workers who have invested as much as an hour in a survey are unable to receive compensation.We recommend pre-testing all HITs on the requester sandbox (http://requestersandbox.mturk.com)and testing codes as part of this process.Clear design for search or evaluation tasks faces the additional challenge of user customization and personalization.Major internet search engines often customize results based on user location and past search history.Requesters seeking to collect data that are comparable across cases should minimize variability by embedding custom search links in the directions, using non-personalized search engines such as DuckDuckGo, as we did in case study 1, and specifying how many results to use (e.g. the first 20); (K.Lewis 2015 also makes this point explicitly for other big data purposes).Search links can contain elements from the input that vary between cases, embed Boolean logic, and restrict results to specific domains.Cases 1 and 3 demonstrated two additional principles specific to data augmentation and other factual HITs: a) iterative data collection, and b) related task grouping.Iterative data collection preferences rapid and efficient collection of a limited range of data over single-shot data collections designed to answer numerous questions.With large online crowdsourcing marketplaces, a sizable labor force is always available, and researchers can easily integrate prior task output into subsequent input.Outside of tasks requiring extensive setup or training, delaying assigned to complete each HIT (e.g.provided identical input) Date(s) The date(s) and time period during which the batch was collected Instrument(s) + HTML, complete description, or screen capture of instrument(s) for tasks exactly as implemented Source of input data What defines cases in the input file and where the data are originally derived from Output variables Descriptive statistics for output variables used in analysis (including missing patterns and worker demography if applicable) Qualification s List of requirements for workers to accept HITs (standard or custom) Rejection criteria Description of how decision was made to approve or reject assignments Rejection rate Proportion of submitted assignments that were rejected Validation check(s) At least one additional procedure (other than qualifications or rejection criteria) to verify data quality.Such procedures include: • Consistency between multiple workers on the same HIT (inter-rater reliability) • Accurate completion of items with known correct answers included in HIT • Worker attention checks (questions with obvious correct answers to ensure workers are reading questions and following directions) • Confirmation in later sequential HITs • Consistency with another method (e.g.automated coding or trained coders) Recommended whenever applicable Item Description Third-party tools Name and version number (or date, if non-versioned) of any third party tools such as Qualtrics or SurveyMonkey used to administer HITs externally Design features Precise description of any contingency, experimental, or quasi-experimental design that is not clear from the instrument (often requires third-party tool) Sampling methodology Information on any sampling process, including the population being sampled, how cases were selected for inclusion, and whether the sample is with replacement Weights List of any weight or adjustment variables and their derivation Panel attrition Standard panel attrition statistics for longitudinal data collection Repeat worker rate For surveys, experiments, and other tasks collecting information about workers, the proportion of HITs completed by workers who had already completed one or more HITs in the study Repeat For tasks collecting information about workers, the proportion of demographic worker consistency responses consistent between HITs by the same worker

Table 1 . Worker tasks in 100 Articles Matching Topic "Mechanical Turk" in Web of Science
through 2015, the last full year in our data.In general, these articles are cited frequently, with Web of Science's citation counts indicating an average of 16 citations (median 8) for articles at least two years post-publication.These levels compare favorably to general article citation counts across many fields, where citation counts often average one per year or less(Thompson   Reuters 2010).We are also interested in what researchers use MTurk for, specifically how often it is used for data augmentation.Table1reports on the types of tasks academic researchers assign to MTurk workers.Because of psychology's disproportionate use of MTurk, we disaggregate results by whether the article was in a psychological field.Most papers used MTurk to field surveys (64%), but data augmentation comprised the second most common category (59%).In Notes: Many studies ask workers to complete multiple tasks, so major categories percentages do not add to 100%.*Two-tailed F test between psychology and other fields significant (p<.001).Another question of interest is how academic researchers report on their use of MTurk as a data augmentation platform.Although researchers use MTurk for data augmentation, we found gaps in reporting standards that may impair the validity, value and replicability of MTurk as a data augmentation tool.Nearly every article we examined (92%) described data collection procedures like HIT content in detail, and most (80%) included at least basic summaries of worker demographics.However, few articles we examined reported required worker qualifications, criteria for work rejection, or validation criteria.Only 16% met what we define as basic reporting standards across three key areas for peer evaluation and replicability: a) a detailed description of the HITs and process, b) information on worker qualifications, acceptance criteria and pay, and c) descriptive statistics, multivariate analysis, or formal validity checks.