S1 Appendix: Does Suffering Suffice? An Experimental Assessment of Desert Retributivism

Michael S. Moore is among the most prominent normative theorists to argue that retributive justice, understood as the deserved suffering of offenders, justifies punishment. Moore claims that the principle of retributive justice is pervasively supported by our judgments of justice and sufficient to ground punishment. We offer an experimental assessment of these two claims, (1) the pervasiveness claim, according to which people are widely prone to endorse retributive judgments, and (2) the sufficiency claim, according to which no non-retributive principle is necessary for justifying punishment. We test these two claims in a survey and a related survey experiment in which we present participants (N = ~900) with the stylized description of a criminal case. Our results seem to invalidate claim (1) and provide mixed results concerning claim (2). We conclude that retributive justice theories which advance either of these two claims need to reassess their evidential support.


Balance statistics
provides balance statistics for sex and age.

Crowd-coding of open-ended responses
In total 119 mechanical turk workers participated in our crowd-sourcing task to classify responses to our open-ended question on aims of punishment. Table C provides some statistics on the crowdsourcing task. We had 881 responses. The idea was to classify each response by 4 raters which would result in a total number of 3524 assignments. In the end our data comprised 3466 analyzable assignments. We crowd-sourced the data in 5 batches in order to be able to assess the rating quality and other statistics along the way. As suggested by [1] we tried to pay workers above the minimum wage of 7.25$. On average our workers recieved a wage of 7.42 $ per hour. Depending on their speed their wage may vary. Mechanical turk workers that were accepted for our task needed to be located in the U.S., have a HIT Approval Rate (%) for all Requesters' HITs greater than 97%, have a number of HITs Approved greater than 1000 and needed to have 'Masters' granted. Masters are elite groups of Workers who have demonstrated accuracy on specific types of HITs on the Mechanical Turk marketplace. We added Masters requirement after Batch 1 and noticed a considerable increase in response quality. The crowd-sourcing task is depicted in Figure B. We provided raters with a set of possible aims of punishment and asked them to classify the responses regarding whether certain aims were mentioned or implied by a respondent's answer. For this task we did not randomize the ranking of the categories since we wanted raters to get used to the classification interface.
Since not all raters coded all responses we use Krippendorf's alpha as a measure of interrater reliability [2]. We calculated alpha for each of the 7 categories into which raters could categorize a response. The results are depicted in Table D. Krippendorf's Alpha ranges from 0.33 to 0.74, i.e., we get categories for which it is relatively satisfying, e.g., rehabilitation, and categories for which is less satisfying, e.g., vengeance.  For the main analysis in the paper we chose a conservative strategy. We only coded a response as belonging to a category such as "suffering" when at least 3 out of 4 raters agreed that it belonged to that particular category. This is a rather strict cutoff and could mean that we underestimate the prevalence of certain aims in the responses. However, we assume that any such underestimation is relatively constant across aims, hence, it shouldn't affect our conclusions about Hypothesis 1. Crowd-coding is both hailed as a useful strategy but also viewed critically [3][4][5][6]. Because Krippendorf's alpha was not higher for certain categories we carried out additonal analyses to see whether our results remain robust to the exclusion of certain workers. Some workers may take the task less seriously than others which leads to measurement error. Below we excluded the codings of workers that finished the assignments in an average time lower than 0.3 minutes, or longer than 5 minutes as well as coded only 1 response. Extremely low average times may reflect superficial codings. Very long times may indicate that workers worked on several parallel assignments and only finished them once the time ran out. Furthermore, we assume that the quality of coding may improve once workers get used to the coding scheme. The results for this rater subsample are depicted in Table E and   Table F the share of responses that are classified as mentioning the aim of suffering is even lower than before the exclusion of certain raters.
Finally, while Table F depicts the prevalence of certain aims across all respondents, Table G depicts the prevalence of certain aims of punishment split across treatment groups. In other words, since we collected the data to test H1 after our survey experiment we could be worried that the considerations queried through the open-ended question are affected by our survey experiment. Table G allows us to explore whether participants's open-ended answers seem to have been influenced by our experimental treatments, i.e., by our experiment. While there are some differences these do not seem to be strong enough to be problematic for a test of Hypothesis 1.

Analysis of variance
In addition to the comparisons and models estimated in our 'Results' Section we carried out classical ANOVA analyses. Figure C displays the averages across all treatment groups. Figure D displays the averages in the treatment groups with samples being split according to values of our two treatment variables -Suffering and Moral Change -independently from the respective other variable. The actual data was spread out using jitter. One-way ANOVA tests yield significant p-values for groups means for both the Suffering treatment (P-value = 0.033) and Moral Change treatment (P-value = 2.15e-09) indicating that some of the group means are different. While there are only two subsamples (groups or values) for Moral Change, we don't know which combinations of the three Suffering subsamples (groups or values) display statistically significant differences. One-way ANOVA tests splitting the sample into groups corresponding to the 6 treatment groups yield the same result. In a next step we perform multiple pairwise-comparison computing Tukey Honest Significant Differences [7], to determine if the mean difference between specific pairs of groups are statistically significant. We find that there is a highly statistically significant difference comparing the Moral Change treatments ("no" vs. "yes"). The difference lies at 1.2 (P-value = 0.00). For Suffering there is a significant difference of -0.63 when we compare the "unhappy" to the "happy" category (P-value = 0.02), i.e., the two extreme categories on this three-point scale. The differences between neutral-unhappy and happy-neutral are not statistically significant. ANOVA tests assume normally distributed data and homogeneous variance across groups. We checked the homogeneity of variance assumption relying on Levene's test [8]. The test indicates a violation for groups of Moral Change but not for groups of Suffering. For this reason we compute a non-parametric alternative to the one-way ANOVA test, namely the Kruskal-Wallis rank sum test [9]. The results from the rank sum test indicate that there are significant differences between our treatment groups for our two Contrasting open-ended, ranking and classic retributivism scale   Figure E visualizes the results of the ranking question that provides respondents with a pre-defined choice set of aims of punishment. However, now we visualize those rankings for subsets of participants that picked particular values on the retributivism scale, either low values (0-3) or high values (7-10). Again we can observe that respondents that pick high values on the retributivism scale more often rank the aim of desert first. However, by far not everyone does. For instance, across both low and high values of the classic retributivism scale a large share of people rank the aim of deterrence in the first place. In other words, when contrasted with the classic retributivism scale both our open-ended measure and our ranking measure reveal that while there is overlap, there is also considerable variation behind the same value on this scale.

Rankings
Share of aims mentioned for each rank