Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Do truth-telling oaths improve honesty in crowd-working?

  • Nicolas Jacquemet,

    Roles Conceptualization

    Affiliation Paris School of Economics and Université Paris 1 Panthéon-Sorbonne, CES, Paris, France

  • Alexander G. James ,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Writing – original draft

    alex.james@uaa.alaska.edu

    Affiliation Department of Economics, University of Alaska Anchorage, Anchorage, Alaska, United States of America

  • Stéphane Luchini,

    Roles Conceptualization

    Affiliation Aix-Marseille University, CNRS, EHESS, Centrale Marseille, Aix-Marseille School of Economics, Marseille, France

  • James J. Murphy,

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Writing – original draft

    Affiliations Department of Economics, University of Alaska Anchorage, Anchorage, Alaska, United States of America, Economic Science Institute, Chapman University, Orange, California, United States of America

  • Jason F. Shogren

    Roles Conceptualization, Supervision, Writing – original draft

    Affiliation Department of Economics, University of Wyoming, Laramie, Wyoming, United States of America

Do truth-telling oaths improve honesty in crowd-working?

  • Nicolas Jacquemet, 
  • Alexander G. James, 
  • Stéphane Luchini, 
  • James J. Murphy, 
  • Jason F. Shogren
PLOS
x

Abstract

This study explores whether an oath to honesty can reduce both shirking and lying among crowd-sourced internet workers. Using a classic coin-flip experiment, we first confirm that a substantial majority of Mechanical Turk workers both shirk and lie when reporting the number of heads flipped. We then demonstrate that lying can be reduced by first asking each worker to swear voluntarily on his or her honor to tell the truth in subsequent economic decisions. Even in this online, purely anonymous environment, the oath significantly reduced the percent of subjects telling “big” lies (by roughly 27%), but did not affect shirking. We also explore whether a truth-telling oath can be used as a screening device if implemented after decisions have been made. Conditional on flipping response, MTurk shirkers and workers who lied were significantly less likely to agree to an ex-post honesty oath. Our results suggest oaths may help elicit more truthful behavior, even in online crowd-sourced environments.

1 Introduction

Online labor markets have become increasingly popular. In social sciences research, for instance, crowd-work platforms like Amazon’s Mechanical Turk (MTurk) offer important advantages over the typical university student subject pool, including low cost, speed of data collection, and access to a more heterogeneous pool of participants [13], although far from being representative of the general population [4, 5]. Despite a more serious attrition problem [6], a large body of evidence shows consistency in behavior between MTurk workers and student subjects in multiple disciplines, including behavioral economics [79], psychology [10, 11], sociology [12], accounting [13], advertising [14] and political science [15]. Based on a large replication study [16], hypothesize this robustness might be related to the homogeneity of treatment effects measured in experiments in social sciences.

Nevertheless, there are concerns that the unique characteristics of online labor markets increase the potential for dishonest and unethical behavior. The workforce is anonymous and transient [17] and is typically unmonitored [18]. Because workers operate remotely in uncontrolled settings, online work can weaken social ties with employers [19] and has the potential for distractions such as cell phones [20] and multi-tasking [21]. Multiple studies have documented the prevalence of dishonest behavior on MTurk. This can manifest itself in multiple ways [see, e.g., 22, for a survey] including misrepresenting whether a worker meets the eligibility criteria for participating in a task [23], rushing through the task so quickly that it is not possible to properly perform the task [24] or shirking by not paying attention [25].

Our study provides direct evidence on dishonesty in online labor markets thanks to a variation of the coin-tossing game [2628]: we asked MTurk workers to flip a coin 10 times and receive an additional 10 cents for each head observed. This design allows us to measure lying at the aggregate level, by comparing the distribution of outcomes to the theoretical truthful distribution. We complement this aggregate measure of lying with an individual measure of shirking, defined as answering the survey without performing the coin tossing task: we combine observed response times and external coin tossing data to classify workers as shirkers whenever their response was too quick to allow them to perform the required task. Consistent with other studies, our results show that dishonesty—both shirking and lying—is prevalent on MTurk: workers do lie, although not fully. Workers reported an average of 6.33 heads, which is significantly different from the expected mean of 5 if all workers were truthful, but is also much lower than the mean of 10 if all lied maximally. Shirking is widespread, with nearly half (42.6%) of workers who completed the coin-flip task at a rate that was physically impossible.

The open question we address herein is whether a non-financial honesty oath works to reduce dishonesty in crowd-working relationships. The solemn oath to honesty is an ancient and time-tested mechanism designed to eliminate misbehavior by asking a person to commit to the truth [2932]. Using laboratory experiments, the oath has been shown to affect behavior in multiple contexts, including the reduction of hypothetical bias in non-market valuation [3335], improving coordination in a strategic game with cheap talk [36] and increasing compliance in tax evasion games [37]. Both [38] and [39] directly test the effect of an oath on truth-telling in a lab setting with European university students, and show that the oath significantly reduced lying in the lab. We provide evidence on the effectiveness of a freely signed truth-telling oath in the field by assigning half the workers to an Oath treatment in which, before the coin-flipping task, they are offered the possibility to take a voluntary solemn oath to honesty.

Our results are threefold. First, while the oath slightly reduced shirking behavior (from 42% to 40%), this change was not significant (p = 0.170). Unless stated otherwise, the p-values provided in the text are associated with one-tailed t-tests. Second, the oath did, however, result in respondents spending almost 30 seconds more completing the survey (p = 0.002) which suggests that the oath may have induced them to be more thoughtful and accurate in their responses. Third, the oath caused a modest (4.2%) reduction in the average number of heads flipped (6.06 vs. 6.33 in the baseline treatment with no oath, p = 0.008) and a large (27%) reduction in the frequency of pay-off maximizing reports (12.9% vs 17.8%, p = 0.006). The oath thus causes workers to answer survey questions more thoughtfully and truthfully, and could be an effective and practical tool to elicit more accurate survey data.

Finally, we investigate whether a voluntary truth-telling oath can be used as a screening device to disentangle truthful answers from dishonesty. To that end, we implemented an ex-post truth-telling oath after the survey in the no-oath treatment. Our results show that both MTurk workers who reported flipping a large number of heads as well as those who did not carry out the coin-flipping task, were less likely to agree to the ex-post oath. Some workers voluntarily self-reported dishonest behavior when they were unexpectedly asked to take an oath attesting to the veracity of their completed work. Together, these methods can improve honesty in online labor markets, and data quality in online experiments.

2 Material and methods

Our empirical evidence comes from an online version of the coin-tossing game introduced by [26]. Our main treatment variable is a truth-telling oath adapted from [33]. The main outcomes of interest are the distribution of heads reported, and response times.

2.1 Survey implementation

The experiment was administered on Amazon’s Mechanical Turk, which is an online platform that connects employers (or “requesters”) with potential workers. The tasks (called “human intelligence tasks”, HITs) that MTurk workers (“MTurkers”) complete are typically simple and straightforward (e.g., answering a questionnaire) and can be completed privately and anonymously at any location. This platform has several advantages for the purpose of our study.

The first is representativeness. MTurkers operate in a naturally-occurring labor market and search for tasks with the goal of earning money. Our subjects answered a call soliciting participants. This stands in contrast with studies using a phone survey [e.g., 40] whose participants were not actively seeking opportunities to participate. MTurkers also tend to be more representative of the US population than in-person samples such as lab experiments [15].

The second is anonymity. MTurkers are only identified by a user id that cannot be linked to any personally identifying infor mation. Participants in lab studies, by contrast, are known to the experimenter even if decisions cannot be linked to an individual. Experimenters also usually have a direct, face-to-face interaction with subjects during check-in, reading of instructions and payment. While participants in phone surveys are typically anonymous, the researcher interacts directly with the subject during all stages, including decision-making, even though privacy and anonymity are maintained (although subjects do not necessarily understand it correctly if, for instance, they think the caller might have access to their name and address thanks to their phone information). The relationship between MTurkers and employers is fundamentally different because there is no direct personal contact. Employers post an ad describing the task, and if the worker accepts, then the task begins. Upon completion by the worker and acceptance by the employer, compensation is made.

Third, because all interactions are highly impersonal, Mturk weakens the connection between the researcher and the worker as compared to other implementations. This undermines the risk of demand effects—i.e., the will of subjects to comply with the experimenter request rather than providing their own preferred answer [41]. Fourth, for this same reason, lying behavior on Mturk is interesting on its own as impersonal market institutions like online labor markets have been hypothesized to foster self-interested non-cooperative behavior [42]. Lying to increase one’s earnings would be consistent with this hypothesis.

2.2 Content of the survey

The experiment was advertised on MTurk and MTurkers could only participate once. The only restriction on participation was to be at least 18 years old. After agreeing to the task, MTurkers were redirected to a Qualtrics survey. The first page thanked them for participating, and elicited informed consent to participate in an academic study. Respondents were informed they would be paid $1.00 for completing the survey, would have an opportunity to earn additional money during the survey, and that all of the information obtained would be kept anonymous and confidential.

Coin flipping task.

The first screen of the survey reads “Flip a coin 10 times. You will be paid a bonus of 10 cents per head that is flipped. How many heads did you flip?”. While the monetary reward for lying is small, note that $1.00 amounts to roughly 30 minutes of work for a typical MTurker—the median wage on MTurk is just $2.00 per hour according to [43]. MTurkers were then provided a drop-down menu where they selected their answer from the list of integers between 0 and 10. Privacy and anonymity, combined with the bonus payment of 10 cents per reported head, create an incentive for MTurkers to report a larger number of heads than what they actually observed since there are no financial or social consequences for lying.

Demographic survey.

After the coin-flipping task, respondents completed a survey which contained demographic questions and a subset of questions from the World Values Survey.

Consistency check.

A common concern with Mechanical Turk is that workers are not thinking carefully about the questions being asked, or that automated programs (“bots”) that are designed to mimic human behavior contaminate the results. To address this concern, we follow [11] and add the following question about halfway through the survey “We want to make sure you are not a robot. What is the number two plus the number three equal to?”. Those MTurkers who gave an incorrect answer to this question are dropped from the analysis.

2.3 Treatment variables

Ex-ante oath.

Respondents were randomly assigned to one of two treatments—Oath and NoOath. To avoid possible temporal bias, we released a batch of 200 HITs every two hours until the budget was exhausted. Upon agreeing to participate in the study, MTurkers in the Oath treatment were shown a screen that reads “Before we begin, do you swear upon your honor to answer the following questions truthfully? (You will be allowed to continue with this survey regardless of your answer to this question)”. In contrast with previous studies of the oath [e.g., 37, 38] MTurkers were not asked to sign the oath due to the online implementation of the procedure. Rather, they were asked to simply click “yes” or “no”. While agreeing to the oath was optional, only two MTurkers did not agree to this oath. Note, the oath was administered prior to MTurkers knowing anything about the task at hand, or the financial opportunity cost associated with being honest. Regardless of how MTurkers answered the oath question, they move to the next screen, which is the coin flipping task described above. The NoOath MTurkers were not presented with the oath screen, and instead went directly to the coin-flipping task. The content of the survey was otherwise exactly the same in both treatments.

Ex post oath.

Immediately following the coin-flipping stage, MTurkers in the NoOath treatment were exposed to an ex-post oath that reads “Do you swear upon your honor that the number of heads you reported flipping is truthful? (You will be paid according to the number of heads you reported flipping regardless of your answer to this question).” MTurkers in the Oath treatment were instead asked “Did swearing upon your honor to tell the truth affect the number of heads you reported flipping?”.

2.4 Measures: Lying and shirking

Lying.

We define lying as intentionally making a false statement, which in this context means an MTurker misreported the actual number of heads observed after flipping the coin. A well-known feature of coin flip experiments is that lying cannot be observed at the individual level, since all decisions are made in private. Dishonesty can only be measured by comparing the aggregate outcomes to the truthful distribution—which requires a large enough sample size for the empirical distribution of draws to be close to the theoretical one. Participants are asked to perform 10 independent draws from a fair coin flip and to report the average of their draws: according to the central limit theorem, the distribution of this sample mean should be distributed normally, with an expected value equal to 5 and a variance equal to 1/4.

Shirking.

We define shirking as the failure to perform the agreed upon task, i.e., not flipping the coin 10 times as instructed. While we do not observe respondents behavior during the survey, some (but not all) shirking can be detected at the individual level based on the amount of time an individual spent on the coin-flipping part of the survey. This response time is measured thanks to a feature embedded in the Qualtrics survey that records how long an individual spent on each page—i.e., the time in seconds elapsed between the page displays and the next page appears. This provides a reliable measure of the time spent on the task since MTurkers were required to answer each question to proceed to the next one, and were not allowed to go back and forth in the survey. Also note that, following standard practice, the survey does not mention the measurement of response times—which minimizes the risk that respondents manipulate the time they spend on the survey to pretend they performed the task.

To determine the minimum amount of time needed to complete this task, we asked 28 students in a large university class to flip a coin that had been provided to them 10 times as quickly as possible, count the number of heads, and enter the result online in the same way MTurkers in the experiment reported their answers. The fastest that any student completed the coin flipping task was 27 seconds, with a mean of 102 seconds. Based on this, we concluded that it was impossible to complete the task in less than 30 seconds (note that 30 seconds is a conservative estimate; in the classroom pilot, students already had a coin available and were prepared to flip before the timer started, whereas for the MTurkers, flipping time also included time spent getting a coin). One may be concerned that subjects did not have access to a coin while answering the survey. First note that subjects could have also “flipped” an online coin, using, e.g., Random.org, or the randomizer app on their smart phone. Such an alternative procedure is unlikely to save time as it requires three time-consuming steps. First a user enters an appropriate URL into the search bar (or accesses a mobile phone, unlocks it and opens the app). Second, the user makes a decision about the number of times a coin should be flipped. Third, the user must count the number of heads displayed, then enter the result into the Qualtrics survey. Assuming that the MTurker already knew of a coin-flipping website or had a randomizer app already installed on a phone, it is still highly unlikely that she would have been able to complete the task within 30 seconds. Second, we surveyed out-of-sample MTurkers and asked them. “Do you have a coin within reach?’’ Conditional on not having a coin within reach we then asked them, “Could you get a coin within thirty seconds?” Out of 454 responses, 335 (73.6%) reported having a coin in reach and 415 (91.2%) reported either having a coin in reach, or said they could get one in less than thirty seconds. Based on this threshold, we can identify those MTurkers who almost certainly did not complete the task (but we cannot identify those who certainly did complete it): we define a “quick” response as one that was completed in less than 30 seconds, and label those workers as “shirkers”. By contrast, a response that was completed in at least 30 seconds is defined as “slow”. Because the task was done in private, we have no way of knowing whether a “slow” MTurker actually performed the task—our measure based on quick responses thus provides a lower bound on shirking in the task.

2.5 Data

We collected data from 1, 410 MTurkers. Of these, we dropped the 43 (3%) MTurkers who failed to correctly answer the consistency check question (about what the sum of 2 + 3 equals). In addition, one MTurker who spent 1, 700 seconds answering the coin flipping question was dropped to minimize outlier bias when we examine flipping times. This leaves 1, 366 observations (681 in Oath and 685 in NoOath). Table 1 provides summary statistics on both treatments. Across the Oath and NoOath treatments, MTurkers were predominantly male (around 60%), white (63%) and physically located in the USA (82%). The average age was 35 (with a standard error equal to 10.7). Across all characteristics, MTurkers in the Oath and NoOath treatments were similar.

3 Results

3.1 Do MTurkers shirk and lie?

We first focus on the NoOath treatment as a baseline to address this question of whether MTurkers shirk and/or lie. Fig 1 shows the distribution of flipping time by treatment (for display purposes, the figure omits those MTurkers who took more than 200 seconds). The vertical line displays the 30s threshold that distinguishes quick from slow responses. The data clearly indicate that, yes, a nontrivial number of MTurkers did not flip the coin as instructed, and did shirk: we observe that 42.6% (N = 292) of MTurkers completed the task in less than 30 seconds. This is comparable to the percent of inattentive MTurkers (42%) documented by [25].

thumbnail
Fig 1. Flipping time distributions by treatment.

Note. Panels (a) and (b) display the empirical distribution of flipping times, by treatment. A red vertical line is drawn at 30 seconds—the threshold defining shirkers. 35 workers who spent more than 200 seconds on the coin-flipping question were dropped to construct these figures. Panel (c) reports the QQ-plot of the quantiles of the NoOath flipping time distribution (on the x-axis) against the Oath one (on the y-axis).

https://doi.org/10.1371/journal.pone.0244958.g001

Table 2 provides evidence on lying behavior based on the distribution of reported flips. Overall, MTurkers reported an average of 6.33 heads and we reject the null hypothesis that this is less than or equal to the expected mean of five if all reporting were truthful (p = 0.000). Fig 2 displays a more detailed comparison between reported outcomes and the truthful distribution. As shown in Panel (a), the modal response (N = 298, 21.8%) was six (a small lie if reported dishonestly), and 18% of MTurkers (N = 122) reported flipping 10 heads in a row (a “big lie”). This result is similar to [27] who find that 20% of subjects “lie to the fullest extent possible” in their die-rolling experiment. Note that the binomial probability of observing 10 heads is 0.1%, which implies that we should expect to observe this outcome no more than once in our sample if all MTurkers reported truthfully. If we put this extreme form of lying on the side, and disregard MTurkers who reported flipping 10 heads, the average number of reported heads flipped is 5.5, which is still statistically different from five (p = 0.000). We therefore conclude that, yes, on average MTurkers do lie. These lies come in two primary forms: some of these lies are plausible (i.e., reporting six) and others are implausible “big” lies that maximize the worker’s earnings (reporting 10).

thumbnail
Fig 2. Heads flipped by flipping time and treatment.

Note. Each figure provides the empirical distribution of heads reported, along with the theoretical truthful distribution. The shaded bars highlight the density of respondents who report having flipped 10 heads. The p-value from Shapiro-Wilk tests of the null hypothesis that each distribution is similar to the normal distribution is <.001 for both the overall distribution and the distribution conditional on the report being lower than 10.

https://doi.org/10.1371/journal.pone.0244958.g002

This conclusion that MTurkers lie is robust across both the shirkers (i.e., MTurkers who completed the task in under 30 seconds) and the slow workers for whom the time spent on the flipping task was sufficient for them to have possibly done the task. As shown in the bottom part of Table 2, shirkers reported more heads than the slow workers (6.79 vs 5.98, p = 0.000). Shirkers are also three times more likely to report observing 10 heads (29.1% vs 9.4%, p <.001, proportion test). Panels (b) and (c) of Fig 2 moreover show that while the modal responses for shirkers were five and 10, for slow workers the mode was six. Still, the mean number of heads reported by slow workers is 5.98 (which is significantly different from five, p <.001, and 9.4% of them reported 10 heads).

3.2 Does an oath reduce shirking and / or lying?

We now examine whether agreeing to a solemn oath causally affects reporting behavior. All respondents but 2 (0.29%) in the Oath treatment agreed to sign the oath. Table 2 shows the unconditional results. The average number of heads reported flipped by MTurkers in the Oath treatment was 6.06, which is 4.2% less than the number reported flipped by NoOath MTurkers (6.33, p = 0.008). That the mean exceeded five (p = 0.000) indicates that the oath is not a panacea for truth-telling. The oath also reduced the number of MTurkers who reported flipping 10 heads in a row by 27% (p = 0.006). In the Oath treatment, 88 MTurkers (12.9%) reported flipping 10 heads in a row whereas 122 (17.8%) of NoOath MTurkers did so.

The first column of Fig 2 gives the distribution of heads flipped for Oath and NoOath treatments. The “truthful distribution” is provided for comparison purposes (according to Shapiro-Wilk tests, the equality between the empirical and the theoretical distributions is rejected for all distributions). The distribution in the Oath treatment is significantly different from that for the NoOath treatment (p = 0.10, one-tailed Kolmogorov-Smirnov (KS) test). However, dropping MTurkers that reported flipping 10 heads, we cannot reject the null hypothesis that the oath had no effect (p = .713). This implies that the oath largely worked by decreasing the number of MTurkers that told big, obvious, lies which is consistent with the idea that telling big lies is more costly than telling small lies [44]. The S1 Appendix, Section C, shows that this change in behavior is unlikely to be due to changes in beliefs about the average behavior of others.

Table 2 also shows that the oath had little effect on the time MTurkers spent answering the coin-flipping question. Further, the oath had no effect on the probability an MTurker shirks (responds to the coin-flipping task in less than 30 seconds). This is confirmed by the empirical distribution of flipping times provided in Fig 1b, which is very similar to the one in the NoOath treatment. To ease the comparison, Fig 1c provides a QQ-plot of the two densities (which are statistically the same, p = 0.458, KS test). The relationship between the distribution of flipping time and the share of subjects reporting 10 heads in both treatments confirms the robustness of these conclusions to the choice of the shirking classification rule; see the S1 Appendix, Section B. However, we do observe that the oath induced MTurkers to spend approximately 30 additional seconds filling out the survey (net of the time spent on the coin-flipping task, see Fig 3). This amounts to roughly a 30/214 = 14% increase in survey duration. One speculative interpretation for these contrasting findings is that workers view their responses to survey questions as potentially consequential; their answers may directly influence any conclusions drawn from the study. In contrast, the coin-flipping task may be viewed as a time-consuming random number generator that can be costlessly avoided by strategically picking a number between zero and ten. While admittedly speculative, this theory is echoed by [25] who write that, “For instance, if respondents to an attitudes survey fail to see the importance of the survey, they will not be attentive in their responses and will respond in a careless manner, yielding useless data.” A potentially useful variant of the present study would be to ask subjects to carry out consequential tasks under oath.

thumbnail
Fig 3. Empirical distribution of survey duration, by treatment.

Note. The left-hand side figures report the empirical distribution of survey overall duration, net of flipping time. To ease readability, 13 workers for whom this duration is higher than 1000s were dropped to construct these figures. The figure on the right-hand side displays the QQ plot of the deciles of the net survey duration in the NoOath treatment (on the x-axis) against the Oath one (on the y-axis).

https://doi.org/10.1371/journal.pone.0244958.g003

Shirkers certainly did not carry out the task as requested whereas slow workers may have carried it out. We now examine the effect of the oath separately for these two groups of people. Table 2 shows that the oath was similarly effective at reducing the number of heads reported flipped by both shirkers and workers. According to Fig 2, the oath reduced the probability a shirker reports 10 heads, and increased the probability of reporting five heads. The oath had a similar effect for slow workers, but for this group the distribution is less bi-modal. However, we cannot reject the null hypothesis that the oath had no effect on the distribution of heads flipped for either shirkers or workers (p = .720, p = .370, KS test of equality between NoOath and Oath distributions in each group).

In sum, signing a truth-telling oath induced a dramatic decrease in big lies among both shirkers and slow workers. It however left unchanged the share of respondents who shirked.

3.3 Heterogeneous responses to the oath

We now turn to the role played by individual characteristics on both dishonesty and the response to the oath. Given the main lessons drawn in the previous section, we consider several outcomes to document dishonesty: the share of subjects who reported having flipped 10 heads in a row, the mean number of heads reported amongst subjects who report a number lower than 10 and the share of subjects who are classified as shirkers (based on flipping times). We also include the overall duration of the survey (net of flipping times) in the set of outcomes so as to asses the robustness of the effect of the oath on this variable.

Table 3 provides the results from Probit and OLS regression models. For each outcome variable, we first look at the heterogeneity in the likelihood of behaving dishonestly, based on regressions in the NoOath treatment, and then move to conditional estimates of the effect of the oath on pooled data from both treatments. The results show that age and gender are the two main sources of heterogeneity in behavior: being young or male increase the likelihood of both over-reporting the number of heads flipped and the likelihood of shirking, but increase the duration of the survey. This large gender difference confirms previous evidence on lying behavior [e.g., 45, 46]. We also find that US citizens were slightly less likely to tell big lies, and that lying was more widespread among Catholics and high-income individuals. The estimates of the effect of the oath conditional on observed heterogeneity confirm the main conclusions from the raw data: the oath significantly decreased the likelihood of reporting 10 heads, had a small and statistically insignificant effect on the mean number of heads reported in the remaining sub-sample, left unchanged the likelihood of shirking and significantly increased the overall duration of the survey.

This observed heterogeneity in dishonesty raises the question of heterogeneous responses to the oath. Coin flip experiments are not well-suited to investigate such heterogeneous responses, since truth-telling can only be observed at the aggregate level. This drastically lowers the statistical power of the analysis. We thus provide exploratory evidence on this question in Table 4, which disaggregates the three dishonesty outcomes across individual characteristics separately in each treatment (the sample size, reported in Table 1, varies across sub-groups as observed heterogeneity was not part of the randomization). In all sub-groups, and both treatments, we observe a large share of subjects who shirked and / or reported the maximum number of heads. Columns (3) and (6) report the mean number of heads reported among subjects in each sub-group whose report was lower than 10. The mean appears in bold whenever it is consistent with truth-telling behavior (i.e., the conditional mean is not different from five at the 10% level, the p-values are provided in the S1 Appendix, Section D). The results in the NoOath treatment provide a better understanding of the lying patterns in our sample. First, the average mean among subjects who did not lie maximally was generally close to five, suggesting that lies in these sub-populations were typically small. Second, reporting behavior was consistent with truth-telling for a few of these subgroups, in particular protestants [47], and non-US citizens. This last subgroup is also more likely to report 10 in Table 3 (this is true for 30% of them in the baseline, while the share is only 15% among US citizens), which suggests a strong self-selection on lying behavior in this sub-group: individuals who lied did it maximally, while others truthfully reported. The same applies to protestants, among whom 30% lie maximally while the remaining report truthfully. Interestingly, while the table confirms large differences in lying behavior according to gender (e.g., 12% of female respondents lie maximally, while 20% of male respondents do so) neither male nor female respondents who do not lie maximally truthfully report.

The right-hand side of Table 4 provides the outcomes observed in the Oath treatment along with the differences between treatments and their statistical significance (the p-values of all statistical tests are provided in the S1 Appendix, Section D). Both columns (4) and (7) confirm a negligible effect of the oath on the likelihood a subject shirked in all sub-groups. By contrast, the oath had a dramatic effect on lying behavior through a decrease in the likelihood of lying maximally by reporting 10 heads. This effect was stronger, and is statistically significant, in sub-populations in which such big lies were more widespread: young people, males, non US citizens and low income people. The oath also slightly reduced the share of protestants who lied maximally, while preserving the truthful reporting behavior of those who did not. Last, for both Asian and Black people who did not lie maximally, the mean number of heads became indistinguishable from truth-full reporting when under oath.

The bottom part of the table correlates the outcomes in both treatments with self-reported attitudes and church attendance. In the baseline, we observe that people who think it is often justified to cheat, steal, bribe, or fail to pay due taxes were more likely to report a high number of heads. We similarly find that people who trust others were less likely to report a high number of heads. The oath again had a stronger effect on the likelihood of lying maximally, and on the subgroups in which this share was the highest. We do not find any strong correlation between dishonesty and the frequency of church attendance in the NoOath treatment—which might be due to the heterogeneity of religious affiliations in our sample. The oath however had a significant effect on the likelihood of lying maximally on low church attendance people—the group in which this share was by far the highest in the baseline. The statistical tests commented on in the text do not account for multiple testing—the inflation in type I error probability due to the implementation of several independent tests on the same data. Table E in the S1 Appendix provides the results of a more conservative approach that adjusts the p-values to account for multiple testing. Based on this approach, the effect of the oath on non-US citizens and low-church attendance people remains significant at the 10% level.

3.4 Ex-post oath

Immediately after answering the coin-flipping question, NoOath MTurkers were asked “Do you swear upon your honor that the number of heads you reported flipping is truthful”. The acceptance rate is 90% (69 participants out of 685 decided not to sign). The average number of heads reported in this subgroup is 6.08, which is significantly greater than five (p = .000), but also significantly lower than the number of heads reported flipped by MTurkers who did not agree to the ex-post oath, equal to 8.50—a 30% decrease. The difference is again mainly driven by “big lies”. For example, the share of subjects who report having flipped 10 heads is 62.3% in the subgroup of respondents who refused to sign the ex-post oath, and 12.8% among the remaining NoOath participants (p <.001, proportion test). Still, we also observe a difference in ‘small lies’ as the average number of heads conditional on the report being lower than 10 is 6.15 in the first group, and 5.50 in the second one (p = 0.048). Interestingly, the screening implemented within the NoOath condition by an ex-post oath achieves outcomes that are similar to the ones observed in the entire population in Oath: both the proportion of Mturkers reporting 10 heads and the mean heads flip conditional on the report being lower than 10 are very similar: 12.9% vs 12.9% (p = 1, proportion test) and 5.50 vs 5.48 (p = .756).

Table 5 reports the results from Probit regressions of the willingness to sign the ex-post oath on the coin tossing task outcomes, with and without control variables. The results show that MTurkers who reported flipping a large number of heads were less likely to agree to the ex-post oath. This result is statistically significant and robust to conditioning on observed MTurker heterogeneity. We also find that MTurkers who reported flipping 10 heads in a row were less likely to agree to the ex-post oath—see columns (3) and (4). Interestingly, the effect of heads flipped remains negative after conditioning its effect on the indicator for flipping 10 heads as well as the indicator for shirking. This implies that even MTurkers who lied a little (did not report flipping 10 heads) were less likely to agree to the ex-post oath than people who reported more honest answers. Also, conditional on heads reported flipped, shirkers were less likely to agree to the ex-post oath—see column (6). This suggests that MTurkers who did not carry out the coin-flipping task may have viewed their behavior as dishonest, regardless of the answer they gave. Taken together, these results suggest that asking MTurkers to swear on their honor following the completion of a task may help identify shirkers and liars.

4 Conclusion

We test whether workers on a crowd-working platform lie and shirk, and explore whether a solemn oath to be honest can reduce the prevalence of both. We asked roughly 1, 400 MTurk workers to flip a coin 10 times and report the number of heads they flipped. They were paid a bonus of 10 cents for each head reported flipped. In this environment, there is a clear and direct cost associated with telling the truth. Although we cannot tell whether individual workers told the truth, we can observe whether groups of people lied on average by comparing the distribution of reports to the underlying truthful distribution. Using response times, we are also able to identify shirkers individually—those MTurk workers who answered the coin-flipping question too quickly to have actually carried out the task.

We find that MTurk workers both lie (as measured by the distribution of heads reported flipped) and shirk (measured as the time spent on the coin flipping task). Offering respondents the possibility to sign a truth-telling oath reduces lying, but leaves shirking unchanged. Whereas workers reported to have flipped 6.33 heads on average in the baseline survey with no oath, workers under oath reported only 6.05 heads (a statistically significant reduction of 4.2%). While the magnitude of this change is small on average, the quantitative effect of the oath is more pronounced when examining “big” lies. MTurk workers who signed the oath were 27% less likely to report flipping 10 heads in a row (an event we should observe in less than 0.1% of the cases according to the true distribution). The oath also induced subjects to spend an additional 30 seconds answering the demographic survey (a 13.5% increase), suggesting the oath caused MTurk workers to answer questions more thoughtfully and carefully. Finally, we found that an ex-post oath (offered after decisions are made) is an efficient screening device: in the sub-population who agrees to sign such an oath, outcomes are behaviorally equivalent to the ones that arise in the entire population under an ex-ante oath.

It is possible that the failure of the oath to reduce shirking was because workers took an oath to honesty, rather than an oath to task (i.e., a commitment to actually perform the task as described). Future research should test whether an “oath to task” can reduce shirking. In addition, it is possible that one reason we observe a large amount of shirking on the coin-flipping task, but a significant effect of the oath on the amount of time spent on the survey, is because workers perceive the survey as meaningful or consequential, whereas reporting the number of heads flipped is viewed as less so. Future research could explore this conjecture further.

Acknowledgments

An older version of this paper previously circulated under the title “Lying and shirking under oath”, Economic Science Institute WP n° 278. Our thanks to Marie Claire Villeval for helpful comments and suggestions at a preliminary stage. Thanks to Kyle Borash for technical support. All errors are our own.

References

  1. 1. Paolacci G, Chandler J, Ipeirotis PG. Running Experiments on Amazon Mechanical Turk. Judgment and Decision Making. 2010;5(5):411–419.
  2. 2. Ross J, Irani L, Silberman MS, Zaldivar A, Tomlinson B. Who Are the Crowdworkers? Shifting Demographics in Mechanical Turk. In: Proceedings of CHI 2010 Extended Abstracts on Human Factors in Computing Systems; 2010. p. 2863–2872.
  3. 3. Goodman JK, Cryder CE, Cheema A. Data Collection in a Flat World: The Strengths and Weaknesses of Mechanical Turk Samples. Journal of Behavioral Decision Making. 2013;26(3):213–224.
  4. 4. Paolacci G, Chandler J. Inside the Turk: Understanding Mechanical Turk as a Participant Pool. Current Directions in Psychological Science. 2014;23(3):184–188.
  5. 5. Walters K, Christakis DA, Wright DR. Are Mechanical Turk Worker Samples Representative of Health Status and Health Behaviors in the U.S.? PLoS ONE. 2018;13(6):e0198835. pmid:29879207
  6. 6. Zhou H, Fishbach A. The Pitfall of Experimenting on the Web: How Unattended Selective Attrition Leads to Surprising (yet False) Research Conclusions. Journal of Personality and Social Psychology. 2016;111(4):493. pmid:27295328
  7. 7. Horton JJ, Rand DG, Zeckhauser RJ. The Online Laboratory: Conducting Experiments in a Real Labor Market. Experimental Economics. 2011;14(3):399–425.
  8. 8. Suri S, Watts DJ. Cooperation and Contagion in Web-Based, Networked Public Goods Experiments. PLoS ONE. 2011;6(3):e16836. pmid:21412431
  9. 9. Amir O, Rand DG, Gal YK. Economic Games on the Internet: The Effect of $1 Stakes. PLoS ONE. 2012;7(2):e31461. pmid:22363651
  10. 10. Buhrmester M, Kwang T, Gosling SD. Amazon’s Mechanical Turk: A New Source of Inexpensive, Yet High-Quality, Data? Perspectives on Psychological Science. 2011;6(1):3–5. pmid:26162106
  11. 11. Crump MJC, McDonnell JV, Gureckis TM. Evaluating Amazon’s Mechanical Turk as a Tool for Experimental Behavioral Research. PLoS ONE. 2013;8(3):e57410. pmid:23516406
  12. 12. Shank DB. Using Crowdsourcing Websites for Sociological Research: The Case of Amazon Mechanical Turk. American Sociologist. 2016;47(1):47–55.
  13. 13. Farrell AM, Grenier JH, Leiby J. Scoundrels or Stars? Theory and Evidence on the Quality of Workers in Online Labor Markets. The Accounting Review. 2017;92(1):93–114.
  14. 14. Kees J, Berry C, Burton S, Sheehan K. An Analysis of Data Quality: Professional Panels, Student Subject Pools, and Amazon’s Mechanical Turk. Journal of Advertising. 2017;46(1):141–155.
  15. 15. Berinsky AJ, Huber GA, Lenz GS. Evaluating Online Labor Markets for Experimental Research: Amazon.Com’s Mechanical Turk. Political analysis. 2012;20(3):351–368.
  16. 16. Coppock A, Leeper TJ, Mullinix KJ. Generalizability of Heterogeneous Treatment Effect Estimates across Samples. Proceedings of the National Academy of Sciences. 2018;115(49):12441. pmid:30446611
  17. 17. Brink WD, Eaton TV, Grenier JH, Reffett A. Deterring Unethical Behavior in Online Labor Markets. Journal of Business Ethics. 2019;156(1):71–88.
  18. 18. Hergueux J, Jacquemet N. Social Preferences in the Online Laboratory: A Randomized Experiment. Experimental Economics. 2015;18(2):252–283.
  19. 19. Napier BJ, Ferris GR. Distance in Organizations. Human Resource Management Review. 1993;3(4):321–357.
  20. 20. Clifford S, Jerit J. Is There a Cost to Convenience? An Experimental Comparison of Data Quality in Laboratory and Online Studies. Journal of Experimental Political Science. 2014;1(2):120–131.
  21. 21. Chandler J, Mueller P, Paolacci G. Nonnaïveté among Amazon Mechanical Turk Workers: Consequences and Solutions for Behavioral Researchers. Behavior Research Methods. 2014;46(1):112–130. pmid:23835650
  22. 22. Keith MG, Tay L, Harms PD. Systems Perspective of Amazon Mechanical Turk for Organizational Research: Review and Recommendations. Frontiers in psychology. 2017;8:1359–1359. pmid:28848474
  23. 23. Chandler JJ, Paolacci G. Lie for a Dime: When Most Prescreening Responses Are Honest but Most Study Participants Are Impostors. Social Psychological and Personality Science. 2017;8(5):500–508.
  24. 24. Smith SM, Roster CA, Golden LL, Albaum GS. A Multi-Group Analysis of Online Survey Respondent Data Quality: Comparing a Regular USA Consumer Panel to MTurk Samples. Journal of Business Research. 2016;69(8):3139–3148.
  25. 25. Fleischer A, Mead AD, Huang J. Inattentive Responding in MTurk and Other Online Samples. Industrial and Organizational Psychology. 2015;8(2):196–202.
  26. 26. Bucciol A, Piovesan M. Luck or Cheating? A Field Experiment on Honesty with Children. Journal of Economic Psychology. 2011;32(1):73–78.
  27. 27. Fischbacher U, Föllmi-Heusi F. Lies in Disguise. An Experimental Study on Cheating. Journal of the European Economic Association. 2013;11(3):525–547.
  28. 28. Abeler J, Nosenzo D, Raymond C. Preferences for Truth-Telling. Econometrica. 2020;87(4):1115–1153.
  29. 29. Tyler JE. Oaths; Their Origins, Nature, and History. London: J.W. Parker; 1834.
  30. 30. Kiesler CA, Sakumura J. A Test of a Model for Commitment. Journal of Personality and Social Psychology. 1966;3(3):349–353. pmid:5906339
  31. 31. Joule RV, Beauvois JL. La Soumission Librement Consentie. Paris: Presses Universitaires de France; 1998.
  32. 32. Joule RV, Girandola F, Bernard F. How Can People Be Induced to Willingly Change Their Behavior? The Path from Persuasive Communication to Binding Communication. Social and Personality Psychology Compass. 2007;1(1):493–505.
  33. 33. Jacquemet N, Joule RV, Luchini S, Shogren JF. Preference Elicitation under Oath. Journal of Environmental Economics and Management. 2013;65(1):110–132.
  34. 34. de-Magistris T, Pascucci S. Does “Solemn Oath” Mitigate the Hypothetical Bias in Choice Experiment? A Pilot Study. Economics Letters. 2014;123(2):252–255.
  35. 35. Jacquemet N, James A, Luchini S, Shogren JF. Referenda under Oath. Environmental & Resource Economics. 2017;67(3):479–504.
  36. 36. Jacquemet N, Luchini S, Shogren JF, Zylbersztejn A. Coordination with Communication under Oath. Experimental Economics. 2017;21(3):627–649.
  37. 37. Jacquemet N, Luchini S, Malézieux A, Shogren J. Who’ll Stop Lying under Oath? Experimental Evidence from Tax Evasion Games. European Economic Review. 2020;20:103369.
  38. 38. Jacquemet N, Luchini S, Rosaz J, Shogren JF. Truth-Telling under Oath. Management Science. 2018;65(1):426–438.
  39. 39. Beck T, Bühren C, Frank B, Khachatryan E. Can Honesty Oaths, Peer Interaction, or Monitoring Mitigate Lying? Journal of Business Ethics. 2020;163(3):467–484.
  40. 40. Abeler J, Becker A, Falk A. Representative Evidence on Lying Costs. Journal of Public Economics. 2014;113:96–104.
  41. 41. Zizzo D. Experimenter Demand Effects in Economic Experiments. Experimental Economics. 2010;13(1):75–98.
  42. 42. Smith VL. Constructivist and Ecological Rationality in Economics. American Economic Review. 2003;93(3):465–508.
  43. 43. Hara K, Adams A, Milland K, Savage S, Callison-Burch C, Bigham JP. A Data-Driven Analysis of Workers’ Earnings on Amazon Mechanical Turk. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems; 2018. p. 1–14.
  44. 44. Mazar N, Amir O, Ariely D. The Dishonesty of Honest People: A Theory of Self-Concept Maintenance. Journal of Marketing Research. 2008;45(6):633–644.
  45. 45. Arbel Y, Bar-El R, Siniver E, Tobol Y. Roll a Die and Tell a Lie—What Affects Honesty? Journal of Economic Behavior & Organization. 2014;107:153–172.
  46. 46. Dreber A, Johannesson M. Gender Differences in Deception. Economics Letters. 2008;99(1):197–199.
  47. 47. Aimone JA, Ward B, West JE. Dishonest Behavior: Sin Big or Go Home. Economics Letters. 2020;186:108779.