Do Monetary Incentives Undermine Performance on Intrinsically Enjoyable Tasks? A Field Test

Size: px

Start display at page:

Download "Do Monetary Incentives Undermine Performance on Intrinsically Enjoyable Tasks? A Field Test"

Howard Payne
6 years ago
Views:

1 Do Monetary Incentives Undermine Performance on Intrinsically Enjoyable Tasks? A Field Test Constança Esteves-Sorenson Robert Broce August 2017 Abstract Economists have long been intrigued by an influential literature in psychology positing that monetary pay lowers performance on enjoyable tasks by crowding out agents intrinsic interest in them. Typical experiments in this literature, however, do not report comprehensive performance measures, in particular those that might yield conflicting evidence on crowding out, and may suffer from confounds. Building on psychology s canonical two-session test for crowding out, wherein agents receive pay for an interesting activity in session one that is withdrawn unexpectedly in session two, we run a field experiment testing whether paying agents for an enjoyable task harms their performance measured by a comprehensive set of measures and if so, whether unmet pay expectations might also contribute to this decline. Results on output, productivity and quits are most consistent with a standard model, instead of a crowding out one. Further, though more speculative, evidence on quality suggests that unmet pay expectations aggravate performance declines. JEL Codes: C92, D03, J33, M52, M55. Keywords: incentives, intrinsic enjoyment, crowding out, field experiments, expectations We thank Rosario Macera for her work in the initial stages of this project. We thank Kaitlyn Croteau, Cathy Le, Tiffany Lin, Karen O Brien and Ignacio Osorio for excellent research assistance. For their comments, we thank Jason Abaluck, Daylian Cain, Jason Dana, Florian Ederer, Florian Englmaier, Kelsey Jack, Lisa Kahn, Botond Kőszegi, Ian Larkin, Ulrike Malmendier, Stephan Meier, Pedro Rey-Biel, Olav Sorenson, Amy Wrzesniewski and seminar participants at UC Berkeley, Pontificia Universidad Católica de Chile, Universidad de Chile, University of Warwick, Yale University and the 36th Annual NBER Summer Institute in Personnel Economics. This research was partly funded by the Russell Sage Foundation and the Whitebox Behavioral Grants at Yale University. constanca.esteves-sorenson@yale.edu, Yale University. brocer1@southernct.edu, Southern Connecticut State University.

2 1 Introduction Boosting worker productivity is a central concern in labor and personnel economics. Although incentive pay is viewed as a key tool for achieving this goal (e.g. Gibbons (1998), Prendergast (1999), Lazear (2000)), an influential strand of research in psychology, pioneered by Deci (1971) and advanced by Deci, Koestner, and Ryan (1999), has argued that pay undermines performance in one important case: when an agent enjoys a task (i.e., engages in it due to the enjoyment he experiences in the activity (Deci (1971), page 105)). In this case, incentive pay may feel controlling to the agent (e.g., Deci, Koestner, and Ryan (1999)) or create the misperception that the task is not enjoyable (e.g., Lepper, Greene, and Nisbett (1973)), decreasing the agent s interest in it and thus lowering performance. This idea, that pay erodes performance when agents enjoy tasks, has received thousands of citations in psychology and, more recently, in economics, and has been part of the standard educational curriculum in human resource and organizational behavior classes at business schools for more than 15 years (e.g., Baron and Kreps (1999), Lazear and Gibbs (2014)). 1 Evidence fueling the prevalence of this idea has mainly arisen from the canonical twoperiod test in psychology for the crowding out of enjoyment: subjects are asked to perform an enjoyable task, such as completing interesting puzzles, and are subsequently randomized to a treatment and a control group: the treatment group is offered an unexpected payment in the first period (e.g., a piece rate per puzzle completed), which is withdrawn unexpectedly in the second period, and the treatment group s performance is compared with that of the control, which is not paid in either period. 2 Whereas a standard economics model predicts that the treatment group s effort in the second period, when pay is withdrawn, should be similar to that of the unpaid control group, the crowding out literature has instead documented the puzzling finding that the treatment group s outcome drops below that of the unpaid control. This literature argues that this is due to the crowding out of enjoyment: the firstperiod payment undermines agents perceptions of the task s enjoyability leading to a worse second-period outcome than would have occurred in the absence of a first-period reward. An issue, however, with tests for crowding out of enjoyment is that different tests report different outcomes, and the choice of outcomes determines whether one finds evidence for or against this idea. A typical example is the seminal Deci (1971) paper. He shows crowding 1 A book with this idea (Deci and Ryan (1985)) has over 20,000 citations; the psychology meta-analyses and 92 articles we reviewed had over 20,000; economics citations exceed 6,000 (source: Google Scholar). 2 See Deci and Ryan (1985) for an extensive review of all studies using this two-period design. 1

3 out by describing a first test wherein pay decreased the amount of time subjects spent solving puzzles, but he does not report whether it reduced the number of puzzles solved (output) or the rate of solving puzzles (productivity); and by describing a second test wherein pay slowed the rate of writing news headlines (reduced productivity), but he does not report time spent on the task (the first test s measure) or the number of headlines (output). Since productivity is the ratio of output over time, could the reduced productivity in the second test result from an increase in time spent on the task that left output unchanged? If so, though the productivity evidence was consistent with crowding out, the output and time evidence would not be; and, further, the evidence on time spent on the task would conflict with that of the first study. The issue of lack of comprehensive reporting of performance outcomes and, in particular, of outcomes that might yield contradictory evidence on crowding out, such as output, productivity and time spent on the task, is not limited to Deci (1971): it occurs in the 96 studies we reviewed in this literature in psychology and in economics. 3 As different tests report different metrics, and because the choice of metric affects whether one finds support for crowding out or against it (in favor of a standard model), there have been conflicting results and meta-analyses on the existence of crowding out (e.g. Cameron, Banko, and Pierce (2001)) and demands for more tests. 4 Further, evidence for crowding out could be confounded with other phenomena: the underperformance versus the after the withdrawal of the reward might, for example, also occur because the unexpected first-period payment creates the expectation of a second-period payment that, when unfulfilled, depresses effort as agents retaliate or lose morale (e.g., Bewley (1999)). We contribute to this research by implementing a field experiment with two goals: first, to test whether monetary incentives undermine performance on an enjoyable task, by studying and reporting not selected performance metrics but rather several of economic interest, such as output, productivity, and quits; and second, to investigate the extent to which unmet pay expectations may contribute to this undermining, in case we observe it. We find that 3 We describe these 96 studies in more detail in Appendix D, which reviews 92 studies, and footnote 6, which reviews an additional four studies. 4 For example, the psychology meta-analysis of Deci, Koestner, and Ryan (1999) found that the bulk of the evidence supported crowding out whereas that of Cameron, Banko, and Pierce (2001) found it did not. Economists have demanded more evidence. Prendergast (1999, page 18) claimed that while this idea holds some intuitive appeal, it should be noted that there is little conclusive empirical evidence (particularly in workplace settings) of these influences. Gibbons (1998, page 130) argued that field experiments [... ] would be especially useful ; Rebitzer and Taylor (2011, page 765): Although [... ] the evidence is not yet conclusive, we are intrigued by the notion that extrinsic rewards can undermine intrinsic motives. For a review see, for example, Kamenica (2012). 2

4 output, productivity, and quits results are most consistent with a standard model: pay boosts performance and its withdrawal does not lead to the pathological underperformance versus the. Additional, though more speculative, results on quality suggest that unmet pay expectations may worsen performance; this evidence could also be consistent with crowding out, but under stronger assumptions. We started by running a field experiment replicating the two-period structure of the canonical test for the crowding out of enjoyment using a simple market research task. Students on two campuses volunteered to blind-taste and rate cookies alone in a room for an anonymous principal, for two sessions, each one week apart, for no monetary pay. They were offered only additional cookies as a thank-you for participating. We chose blind tasting because we relied on self-selection into the activity, in the absence of monetary pay, to reveal liking for it and judged blind tasting to be appealing enough to induce this selection. Ensuring that subjects liked at least one component of the activity the one targeted for pay is important since proponents of the crowding out have argued that this is a necessary condition for pay to erode performance (e.g., Deci, Koestner, and Ryan (1999)). For example, subjects may enjoy completing a puzzle (first component) but they might like or dislike searching for pieces (second component). However, it is key that they at least enjoy the first component completing puzzles so that paying them a piece rate per puzzle completed reduces performance. 5 Blind-tasting cookies has two main components: tasting a cookie (e.g., inspecting it and taking a bite) and evaluating it (completing an evaluation sheet rating it on several dimensions). Although tasters may select into the blind tasting because, in addition to liking tasting, they also like evaluating, selection revealed whether tasters at least enjoyed tasting: if they disliked tasting cookies (besides disliking evaluating) it is unlikely they would volunteer to taste and evaluate for no pay. Second, blind tasting is a task that mainly benefits a principal and an agent, and not a third party, as is typical in employment. It also offers key performance metrics, such as output (number of cookies tasted and evaluated); productivity (minutes per cookie tasted and evaluated); and quits (percentage of tasters that quit after being paid, leveraging our two-session design). It also offers a further metric, time spent on the task, the most used in this literature. We report it, though we view it as a more fragile indicator of performance if not paired with output, for example: if pay reduces agents time on the task but does not 5 As discussed in Section 2.2, the rationale is that when the task that is being rewarded (e.g., completing a puzzle) is not enjoyable then pay cannot undermine performance because there is no enjoyment to erode. 3

5 reduce output (thus boosting productivity), then it is unclear that pay harms performance. After subjects self-selected into the blind tasting, they were randomly assigned to three groups. Those in the blind-tasted in both sessions and got only thank-you cookies at the end, as promised. Those in the Unanticipated group were surprised with a $0.75 piece rate per cookie rated in the first session (but got no information about a secondsession payment); at the beginning of the second session they were informed that the piece rate would not be granted. This treatment mirrors that in the canonical psychology test, where underperformance vis-à-vis the control group after the payment is withdrawn has been ascribed to pay eroding interest in the task. As noted, however, this undersupply could also be due to unmet expectations for a second-period reward. To test whether an environment with no surprises would yield non-standard behavior, we ran a no surprises Anticipated treatment, in which the introduction and the withdrawal of pay were disclosed in advance: after recruiting, but one week before the first session, tasters were informed that they would receive a $0.75 piece rate in the first session but not in the second. Further, as the increased effort in the first period could lead to fatigue or satiation in the second period, which has been a potential confound with crowding out in prior tests, we improved the canonical design by separating the two sessions by one week. All results, except for one, on output, productivity, and quits, across both treatments and both sessions are consistent with a standard model. The piece rate in the first session boosted output by at least 60% and productivity by 28% in either treatment, consistent with a standard model. After the withdrawal of pay, in the second session, output was not lower than that of the in either treatment, even taking into account quits, also consistent with a standard model. Quit rates after pay in the first session differed per treatment and are also consistent with a standard model: they were 4% lower than the in the Unanticipated condition (congruent with subjects expecting to be paid in session two) and 14% higher than the in the Anticipated condition (congruent with subjects not expecting to be paid in session two). Productivity in the Anticipated condition during the second session was not lower than the s, consistent with a standard model. The result that is inconsistent with a standard model is that productivity in the Unanticipated condition exceeded that of the after the surprise withdrawal of pay in the second session. This result is also inconsistent with crowding out: for example, Deci (1971) argues that crowding out led to lower productivity than the after the unexpected removal of the piece rate. But since he did not report time spent on the task or 4

6 output, it is unclear what caused the lower productivity: could subjects have spent more time on the task while producing the same output (both incongruent with crowding out)? In our case, the puzzling excess productivity was due to subjects having slightly higher output than the (inconsistent with crowding out), but spending slightly less time on the task (qualitatively consistent with crowding out, though not statistically significant). We conjectured that this result could be due to subjects, displeased by the withdrawal of pay, tasting their cookies and departing as quickly as possible, leaving in their wake sloppier ratings. Our more speculative evidence on the quality of the evaluations measured by the randomness of the ratings suggests this might have occurred, as these subjects rated cookies more at random than those in the. Thus, though the surprise withdrawal of pay did not lead to a deficit of output and productivity it appears to have led to a deficit in quality, though the quality evidence is more tentative. A deficit in quality could also be consistent with crowding out, but only under less plausible assumptions, such as that rating carefully was the only enjoyable component of the task and that pay ruined it. This paper presents, to our knowledge, the first test in economics or in psychology using several performance metrics to explore, as transparently as possible, whether paying agents to perform an enjoyable task that primarily benefits them and a principal (as is typical in employment) undermines performance. also report time spent on the task and quality. We focus on output, productivity, and quits but In particular, our review of 96 studies in psychology and in economics (see Appendix D, which reviews 92 psychology papers their experiments, tasks, sample sizes, outcomes, etc and footnote 6) documents that no prior test has jointly reported output, productivity and time spent on the task, which is important as these interrelated measures may yield conflicting evidence: for example, a decline in time spent on the task while consistent with crowding out, increases productivity (inconsistent with crowding out) if output does not decrease (also inconsistent with crowding out). 6 Importantly, our test differs from others in economics, which have mainly focused on 6 The few tests in economics investigating the potential role of crowding out on performance in tasks that primarily benefit agents and a principal have mostly used single performance measures and have found conflicting results. Gneezy and Rustichini (2000) found that students paid a fixed monetary fee for a one-time 45-minute laboratory session answering IQ questions had fewer correct answers (the single outcome measure) if paid an additional, but low, piece rate (it did not investigate other performance metrics; or explore how pay affects subsequent performance; or on whether the task was enjoyable). In contrast, Hossain and Li (2014) found that paying subjects for a task framed as regular data-entry work in one session did not reduce subjects willingness to work in a second session nor did it reduce output or quality in the second session (it did not report time spent on the task or productivity, or explore whether data-entry work was enjoyable). Huffman and Bognanno (2014) found that subjects paid a flat fee and later given a piece rate for signing- 5

7 testing whether pay dampens outcomes on tasks that mainly benefit a third party, such as donating blood or to a charity (named prosocial tasks). Pay may harm outcomes in these tasks (e.g., lower donations) by spoiling agents signal of being prosocial (Bénabou and Tirole (2006)). 7 2 Field Experiment Design To motivate the field experiment design and the performance measures, we start by describing the leading evidence in psychology for the crowding out of enjoyment. 2.1 Prior Tests of Crowding Out of Intrinsic Enjoyment In the most cited paper in this literature, Deci (1971) investigated whether pay dampened performance in enjoyable tasks with two tests. In the first, students were asked to solve puzzles in three consecutive sessions to fulfill a class requirement. This task was selected as it was presumed that students would enjoy it and indeed they rated it as highly enjoyable on a 9-point scale. Subjects were randomly assigned to a control and a treatment group of 12 subjects each. In each session they completed puzzles in front of an experimenter for about one hour. Those in the treatment were surprised with a $1 piece rate per puzzle completed in the second session (the reward session). At the beginning of the third session, however, they were informed that the piece rate would be withdrawn due to lack of funds (the non-reward session). Those in the control were never paid. Enjoyment was measured as the time subjects spent trying to complete puzzles during an eight-minute window in which up people for a database, reduced sign-ups (output, their performance measure) after the withdrawal of the reward. They viewed the pattern of the drop in output as not fully consistent with crowding out. In psychology, a recent study by Goswami and Urminsky (2017) showed five experiments using task choice (e.g., choice of a math task or of viewing videos) as a main outcome. It concluded that the drop in performance post-withdrawal of the reward was not due to crowding out but subjects desire to take a break. Output, productivity and time spent on the task were not jointly reported for each experiment. A meta-analytic aggregation of their disparate experiments and conditions briefly mentioned output and productivity, among other measures, but not time spent on the task. For theory, in economics, of the effects of crowding out on performance on tasks that benefit mainly principals and agents, see for example, Bénabou and Tirole (2003). 7 The evidence that pay undermines outcomes on prosocial tasks is also mixed. Ariely, Bracha, and Meier (2009) found that incentive pay induced contributions to a charity only when these incentives could not be publicly observed. And though Mellström and Johannesson (2008) found that pay reduced the supply of blood among women (but not among men), Lacetera, Macis, and Slonim (2012) found no blood supply reductions. Ashraf, Bandiera, and Jack (2014) also found that financial incentives did not undermine performance among volunteers for a task with a prosocial component: the sale of female condoms for HIV prevention. Chetty, Saez, and Sandór (2014) also found that incentives did not undermine prosocial behavior in the provision of referee reports. For a review of the effect of incentives in non-employment settings, such as education and contributions to public goods, see Gneezy, Meier, and Rey-Biel (2011). 6

8 the experimenter left the room (the free-choice window). Deci found that subjects given the unexpected piece rate spent more time than control subjects trying to complete puzzles during the free-choice window. After this payment was withdrawn, however, subjects spent less time than those in the control trying to solve puzzles during the free-choice window, a difference significant at the 10% level in a one-tailed test. Output (number of puzzles solved) and productivity (minutes spent solving each puzzle) were not reported, nor were subjects given the option of forgoing solving puzzles (quitting) after being paid. The influential Deci (1971) paper also describes a field experiment, one of the few in this literature, with additional evidence of a shortfall in performance after the withdrawal of pay: lower productivity and increased quits. Eight students staffing a college newspaper were randomly assigned to a control and a treatment group. Four subjects in the treatment were offered $0.50 per headline written over three weeks. At the end of this period they were informed that funds had been exhausted and they would no longer be paid. The four subjects in the control were never paid. Enjoyment was measured as the number of minutes spent per headline (productivity): the faster subjects wrote a headline (i.e., the more productive they were), the more they liked the activity. Deci (1971) uses the lower productivity in the treatment an increase in the minutes to write a headline in the three weeks after pay was withdrawn as evidence for crowding out (noting that quits reduced the control to two subjects). 8 Time spent of the task (writing headlines) and output (number of headlines) were not reported. He also uses higher quits in the treatment after the withdrawal of pay as additional evidence for crowding out, although this is also consistent with a standard income effect. This test and Boal and Cummings (1981) were the only two field tests with adults in our survey of 92 studies on the crowding out in psychology (see Appendix D). In both tests, treatment group outcomes below those of the control during the non-reward period have been viewed as evidence of the perverse effect of pay on enjoyment. The two leading explanations for this phenomenon have been cognitive evaluation theory and the overjustification hypothesis. The first proposes that rewards are construed as unpleasant controllers of behavior, undermining individuals intrinsic motivation, which refers to doing something because it is inherently interesting or enjoyable (Ryan and Deci (2000), page 55). The second postulates that a person paid to perform an interesting activity may infer that his actions were basically motivated by the external contingencies [...], rather than by an 8 Although productivity is usually measured as output over inputs (such as time), the crowding-out literature usually studies productivity by measuring time over output (e.g., minutes to write a news headline), as it is often easier to interpret. We will follow the same procedure in our analysis. 7

9 intrinsic interest in the activity itself (Lepper, Greene, and Nisbett (1973), page 130). This latter paper, the second most cited in this literature, uses this hypothesis to explain why nursery school children who were surprised with a prize for drawing spent less time drawing during the subsequent non-reward period than those children who were never rewarded. Tests for crowding out based on these two theories require a reward period followed by a non-reward one. Tests for overjustification need an initial period in which subjects are paid, leading them to misattribute their interest in the task to the reward, followed by a nonreward period in which the results of the misattribution become apparent. Tests of cognitive evaluation theory also need two periods, as argued by Deci and Ryan ((1985), page 184), as in the first period there is a trade-off between the displeasing nature of the reward and its incentive effect, and thus it is difficult to disentangle which dominates. For this reason, except for the three-period setup in Deci (1971), crowding-out studies have implemented a two-period, between-subjects test in which the control is unpaid in both periods and the treatment receives an unexpected reward in period one that is withdrawn in period two. This two-period design, though appropriate for testing the two theories above might, however, introduce the confounds of satiation (declining marginal utility) or fatigue (nonseparable cost of effort) or of unmet wage expectations: pay in period one may cause the expectation of pay in period two, which when unmet, may also lead to underperformance. 9 Tests of crowding out of enjoyment must also ensure that subjects like the task that is being incentivized. If not, then pay will not undermine performance because there is little or no intrinsic motivation to crowd out (Deci, Koestner, and Ryan, (1999), page 633). Failures to replicate crowding out have led to debate on whether subjects enjoyed the tasks in the first place (Deci, Koestner, and Ryan (1999) and Cameron, Banko, and Pierce (2001)). Typical measures used to assess interest have relied on reasonable, but arbitrary, cut-offs on scales rating enjoyment or on the time spent on the task prior to the start of the experiment. There is thus debate on whether those who rated their enjoyment as a 5 on a 9-point scale or spent four minutes on the task find the task enjoyable while those who rated it is as a 4 or spent 3 minutes on it do not. Our test therefore relies on a less subjective criterion to assess inherent interest in the rewarded task: self-selection into it in the absence of monetary pay. 9 Early on Calder and Staw (1975) proposed that shortfalls in performance following the withdrawal of a reward could partially be due to subjects being angry at the experimenter. This idea was dismissed by the literature (e.g., Deci (1985)) with the reasoning that because subjects had begun the activity without expecting to be paid, the introduction of pay followed by its unexpected removal should not upset them. 8

10 2.2 The Field Experiment The field experiment comprised one leg on campus A in January 2012 and three additional legs on campuses A and B in April, June, and July of We used two campuses to gather a larger sample and to assess whether the results held across two separate environments. (1) Two-session activity and recruitment of subjects who find the task enjoyable. Our experiment builds on blind tasting, a common task in market research for many goods (e.g., wines, sodas, cheese, cookies). This task mainly benefits the agent (who tastes the goods) and the principal (who receives the evaluations). Thus this is not a prosocial task, like donating blood or money, where the main motivator is the benefit to others. The blind tasting was advertised on two campuses in Connecticut as a two-session activity through flyers and electronic mailing lists. Interested students contacted a research assistant who described the task as follows: You need to taste and evaluate cookies in two sessions, exactly one week apart. You will taste alone, filling out an evaluation form rating each cookie s flavor, aroma and other characteristics. You will not be paid for the task and can taste as many or as few different cookies as you like for up to three hours due to room availability constraints. At the end of the second session, you will receive a luxury Godiva cookie tin as a thank-you gift. As is common in blind tasting the principal who commissioned the study remained anonymous so as not to bias tasters ratings. Tasters tasted cookies alone, for the two sessions, and were unaware that they were participating in a study on incentives. Blind tasting thus has two main components: (i) tasting cookies (e.g., inspecting them and taking a bite) and (ii) evaluating them (filling in a form rating each cookie over several dimensions). Tasters may like tasting but dislike evaluating. Or they might like both tasting and evaluating: research suggests that individuals who like a topic exert more effort in evaluations (e.g., answer more questions, write longer answers to open-ended questions) due to a halo effect (Groves, Presser, and Dipko (2004)) and Holland and Christian (2009)). For the purposes of the field experiment, it was only necessary that tasters liked to taste cookies as they would later be incentivized to do so. We relied on self-selection into the blind tasting for no monetary pay (instead of on a more arbitrary measure, such as task enjoyment ratings) to assess whether participants enjoyed the task. We offered no pay, such as a show-up fee or gift certificate, during recruiting because 9

11 individuals who found blind tasting distasteful might nonetheless participate to receive this pay. However, because in most tastings participants receive a thank-you gift (e.g., a gift certificate), we offered volunteers a thank-you gift as well, but one that would not undermine selection into the study: thank-you cookies. We assumed that the more tasters liked tasting cookies the more they would enjoy the thank-you cookies. An implication of this monotonic relationship is that if agents disliked tasting cookies, they would also dislike the thank-you cookies and thus not self-select into the blind tasting because of them. Further, the thankyou cookies were perishable and hard to resell reducing the chance that tasters disliked the blind tasting but self-selected into it to resell the thank-you cookies. 10,11 Thus, those who selected into the blind tasting revealed that they liked tasting enough so that the intrinsic utility of tasting the study s cookies and of the thank-you cookies outweighed the cost of evaluating (if they disliked evaluating) and the opportunity cost of other uses of their time. This intuition is also outlined in the model in Appendix E. (2) Task implementation and how it addresses fatigue, satiation, and other confounds. Tasters were required to fill out an evaluation form after tasting each cookie, as is typical in blind tasting. They rated cookies on a scale of 1 (Excellent) through 5 (Poor) along seven major dimensions: Appearance (e.g., Does it look chewy? ), Aroma (e.g., Does it smell home-baked? ), Snap (e.g., Does it break easily? ), Texture (e.g., Is it chalky? ), Start (e.g., Does the flavor develop quickly? ), Flavor (e.g., Does it have a minty flavor? ), and Overall Rating ( What is the overall rating of this cookie? ). 12 At the start of session one, tasters signed a consent form ensuring, for example, that they were aware of allergens (e.g., some cookies had nuts), and they answered a short demographic questionnaire. 13 To allow for variability in outcomes, tasters were fairly unconstrained as to how many cookies they could taste and how much time they spent tasting. They could try up to 70 cookies per session in up to three hours (due to site availability). To minimize satiation and fatigue, two potential confounds with crowding out, the two tasting sessions were scheduled one week apart and tasters were informed they could just take 10 It is also possible that tasters disliked tasting cookies but still selected into the task because the thankyou cookies conveyed other benefits, such as being used as a gift to a friend or relative. The results we show later document that this scenario seems unlikely: tasters engaged substantially with the task (e.g., those in the spent, on average, 1 hour and 19 minutes tasting cookies), despite the fact that the thank-you cookies were not contingent on the time they spent tasting or the amount of cookies they tasted. 11 Source: There were no listings on ebay for the whole of North America, accessed on January 2012, for the resale of the type of thank-you cookies offered in our test. 12 See Appendix G for the cookie evaluation sheet. 13 See Appendix H for the protocol. 10

12 a bite of each cookie. Thus, if subjects became satiated from tasting and/or fatigued with the physical or mental effort of filling out the evaluations in the first session, for example, they had one week to recover. They were also informed that they could merely partially eat each cookie: take a bite and leave the uneaten portion in the tasting cup. Both the onsite research assistant said this and this information appeared in writing on each evaluation sheet. Limiting satiation and fatigue was important because they could cause a secondperiod effort shortfall following a first-period effort increase induced by the reward. There is debate on the extent to which satiation or fatigue are confounded with the crowding out of enjoyment (Cameron and Pierce (1994) and Deci, Koestner, and Ryan (1999)) because non-reward sessions have immediately followed reward sessions in all the studies with adults (Deci, Koestner, and Ryan (1999), page 650). Further, because declining marginal utility from eating cookies could also be confounded with crowding out, we offered non-overlapping sets of 70 cookies in the first and second sessions. We also ensured that differences in outcomes across the three conditions could not result from unobserved differences in cookies or in research assistants. Within each leg, campus and session, all subjects in all three groups were given the same set of 70 cookies. All subjects also interacted with the same research assistant who was blind to the research hypothesis. Tasters also worked alone and had no contact with other tasters to avoid peer effects on outcomes, another potential confound (e.g., Mas and Moretti (2009)). 14 (3) Treatments. After recruiting, tasters were randomly assigned, without their knowledge, to three groups:, Unanticipated and Anticipated. Those in the performed as agreed upon recruitment. They came to their assigned rooms for the two sessions, tasted and evaluated cookies, and got thank-you cookies at the end of session two. The thus established baseline outcomes in the absence of incentive pay. Those in the Unanticipated condition were surprised, at the start of session one, immediately before beginning tasting, with the information that they would get $0.75 per cookie tasted and evaluated. There was no mention of pay for session two. One week later, at the start of session two, they were informed that they would not be paid the piece rate. 15 setup replicates that in the typical two-period design in psychology, in which a first-session surprise payment is withdrawn in the next session. 14 To ensure tasters had no contact with each other, tasters came at separate times to the tasting site, tasted in a room with the door closed, and entered and exited the work site through different doors. 15 We offered no cover story for the withdrawal of the payment as this would entail deception. This 11

13 Crowding out can thus undermine performance in this condition on one or two margins: (i) tasting or (ii) tasting and evaluating. If tasters like tasting but dislike evaluating, the percookie piece rate undermines output or productivity by eroding the marginal intrinsic utility in tasting while leaving the marginal cost of effort from evaluating unchanged. If tasters like both tasting and evaluating, the per-cookie piece rate undermines output or productivity by eroding the marginal intrinsic utility in both tasting and evaluating, leaving the marginal cost of effort associated with other aspects of the task (e.g., the physical effort of holding the pen) unchanged. It is important to note that even enjoyable activities must have effort costs otherwise agents effort would be unbounded (see the model in Appendix Section E). Potential second-period shortfalls in performance for surprised subjects could result, however, not only from incentive pay diminishing enjoyment, but also from unfulfilled pay expectations. To investigate the role of unmet wage expectations in dampening performance we ran a no-surprises Anticipated treatment. Tasters randomly assigned to the Anticipated condition received a telephone call or one week before the first session informing them that they would get $0.75 per cookie tasted and evaluated in the first session but not in the second. Importantly, the information about the payment structure was offered after recruiting and random assignment were completed, so that enrollment could not be due to the incentive scheme (see Appendix H for the protocol for each treatment). Thus, in this treatment agents were not surprised with pay in either session one or two and thus they should behave according to standard model (unless there is crowding out), as shown in the model in Appendix E. This idea builds on expectations-based reference-dependent preferences (Kőszegi and Rabin, (2006, 2007)) wherein when there are no deviations from expectations agents behave as consumption utility maximizers, i.e., as standard agents, in our case. As we show later, agents in this no-surprises environment did indeed behave in keeping with a standard model in both sessions and on all measures. Relatedly, though this treatment would have been more closely comparable to the Unanticipated treatment if subjects were surprised in the first session with pay and then informed, sometime before the second session that they would not be paid, we felt this option would have had a crucial drawback: it could increase the likelihood that agents would still expect a reward in the second session. Namely, agents could think I was not expecting pay in session one but the principal pleasantly surprised me with pay; now the principal tells me she will not pay me in session two, so I should not expect to be paid again, but it is possible she might pleasantly surprise me again as she did in session one. And thus agents 12

14 could have come to the second session with some expectation of being pleasantly surprised with pay. We worried that this expectation, if not realized, could have depressed effort and be again confounded with crowding out. We thus hoped that by giving advance warning of the pay scheme and following through in session one paying the piece rate as promised and thus not surprising agents would help establish the principal as a reliable promise keeper thus reducing the expectation of a surprise in session two. As a result, the Unanticipated and Anticipated conditions are not directly comparable; however, the latter is still a useful way to test whether one could observe pathological behavior (underperformance in session two) when there are no surprises for the agent. 3 Results This section documents the results for our three main performance measures output (number of cookies tasted and evaluated), productivity (minutes per cookie tasted and evaluated) and quits (percentage of tasters who quit after being paid in session one) and discusses whether our findings are more consistent with a standard or with a crowding out model. The predictions of these models are straightforward. Under a standard model output and productivity for either treatment exceed those of the in session one (agents work more when paid the piece rate) and these outcomes return to the level of the in session two, when the piece rate incentive is withdrawn. Under a crowding out model, output and productivity in either treatment may exceed those of the in session one (depending on whether the incentive effect of the piece rate dominates over crowding out) but these outcomes are lower than the s in session two, after the piece rate is withdrawn, in line with output and productivity findings in prior crowding out research (see propositions 1 and 2 in Appendix E). As for quits in the treatments in session two, these will result from the trade-off between the income effect of the piece rate in session one (leading to higher consumption of leisure i.e. quits in session two) and whether there is the expectation of a piece rate in session two (inducing subjects to return). Because the income effect of the piece rate in session one cannot be disentangled from crowding out (both induce subjects to quit in session two), we point out that the quit rates in session two are consistent with a standard model, but could also be consistent with a crowding out one. Importantly, since the crowding out literature uses time spent on the task most often (see Appendix D), we also describe it briefly. However, as noted, this measure is somewhat uninformative of performance, if not paired with other outcomes, such as output. Namely, if one observes that pay reduces time spent on a task, but output does not decline (and hence 13

15 productivity increases) then it is less clear that pay undermines performance. 3.1 Descriptive Summary and Graphical Evidence Sample and summary statistics. We recruited 91 participants for the four legs of our experiment on two campuses, A and B. Once recruited, participants were randomly assigned to the three conditions as follows: 37 to the, 27 to the Unanticipated and 27 to the Anticipated. Most subjects (76) came from campus A, where the facilities could accommodate more people (see Appendix Table C.7 for campus and treatment breakdown). Most (81) attended session two: 34, 26 and 21 in the, Unanticipated, and Anticipated conditions, respectively. 16 Subjects engaged significantly with the task though there was substantial variability. Table 1 shows that in the 172 subject-sessions (91 in session one and 81 in session two), tasters tasted an average of 35 cookies (with a minimum of 4 and a maximum of 70); spent an average of 2.6 minutes per cookie; and spent on average 81 minutes tasting (with a minimum of 12 and a maximum of 182). 17 Of the cookies tasted, 70% were partially eaten indicating that subjects listened to the instructions that they could merely take a bite. Summary of output and productivity evidence for the first (reward) session. Table 2, column (2) suggests that the piece rate boosted both output and productivity. During the reward session, those in the Unanticipated and Anticipated treatments tasted and rated 60%-70% more cookies than the : 48 and 51 versus 30. They also tasted and evaluated each cookie at least 28% faster than the : in 2.3 minutes 16 Prior to running the full experiment with the 91 subjects, we ran a small pilot with nine subjects assessing their response to the $0.75 piece rate, but using a smaller number of cookies (60 or fewer) and a shorter questionnaire. Because we found that subjects often reached the upper bound of cookies to taste, which could lead to low variability in outcomes in the main experiment, we increased the number of cookies to 70 and extended the questionnaire length. We also dropped two subjects from the Unanticipated treatment because the research assistant gave them the wrong instructions at the beginning of session two (neglected to remind them that they would not be paid, and as a result these two subjects labored under the impression they were going to be paid in session two as they asked for their piece rate earnings at the end of this session). Our sample size resulted from a compromise between the costs of running our test and what seemed like a reasonable number of subjects given the literature and our pilot. Because there were no studies with a similar task and piece rate to ours, we had no prior data for precise power calculations. However, given that in our survey of 92 studies on the crowding out the median condition size was 14 subjects and influential studies on crowding out have had even fewer subjects (e.g., Deci (1971) with 12 or fewer per condition), we thought it reasonable that our larger sample would be able to detect crowding out (we ran rough power analyses confirming this). Further, our small pilot suggested that subjects were very responsive to the primary input the piece rate suggesting that we would be able to detect statistically significant responses. 17 Although productivity is typically the ratio of output to time, to be consistent with the literature and for ease of exposition, we use the ratio of time to output: minutes per cookie tasted and evaluated. 14

16 versus 3.2. Importantly, these averages were not driven by a few outliers, a potential concern since our samples were not very large, but shifts in whole CDFs of output and productivity versus the CDF of the (left panel of Figures 2 and 3). Thus, we had enough power to detect these output and productivity boosts at the 1% level, as shown in the next section. Summary of quits, output and productivity evidence for the second (nonreward) session. Table 2, columns (1) and (4) show that the quit rate from session one to session two varied by condition. It was 8% in the (3 in 37 tasters), which established the baseline quit rate. It was 4% in the Unanticipated (1 in 27) and 22% (6 in 27) in the Anticipated conditions. As shown in Appendix Table A.1, the 4% lower quit rate in the Unanticipated condition relative to the was not statistically significant (p-value of 0.46). However, the 22% quit rate in the Anticipated condition (i) was almost three times that in the, exceeding the s by 14%, which was close to marginally significant (p-value of 0.13), and (ii) led to a 18% higher quit rate versus the Unanticipated condition (p-value of 0.04, row (4)). Taken as whole, (i) and (iii) indicate that quits were substantially higher in the Anticipated than in either the or Unanticipated conditions. Table 2, column (5), documents that the average output for those in the Unanticipated condition was close to (though slightly higher than) that in the : 29 versus 26. Their productivity was higher as well: they tasted and rated each cookie in 2.0 minutes versus 2.8 minutes for the, or 29% faster. This pattern in average output and productivity versus the also reflect patterns in whole CDFs instead of a few outliers (right panels of Figures 2 and 3). Those in the Anticipated group during the non-reward session also produced slightly more than those in the 32 versus 26, respectively and were similarly productive, tasting and rating each cookie in 2.7 minutes versus 2.8, respectively. Further, output was still slightly higher than the s even when counting quitters as producing zero output, the worst-case scenario: 25 versus 23 (column (5), rows (2) and (12), respectively). Once again, the pattern in the averages reflects not a few outliers, but rather the pattern of the whole CDFs of tasters productivity and output (right panels of Figures 2 and 3). Brief summary of time spent on the task for session one and session two. Table 2, column (2) shows that subjects in either treatment spent more time on the task than the during the first session: 102 and 112 minutes in the Unanticipated and Anticipated conditions, respectively, versus the, at 79 minutes. 15

17 Column (5) documents that during the second session, those in the Unanticipated condition spent 6 fewer minutes than the on the task (56 versus 62 minutes, respectively). This decrease of 11% in time spent on the task relative to the combined with a 12% higher output than the (29 versus 26) approximates the increase in productivity noted above of 29% (of 0.8 minutes over 2.8 minutes in the ). 18 The 6-minute deficit shrinks to 3 minutes if one considers quitters as supplying zero time (57 versus 54 minutes, in rows (5) and (10), respectively, or an undersupply of 5%). Our undersupply of time on the task for the Unanticipated condition is congruent with prior crowding out studies showing that time spent on the task declines after the unexpected removal of pay (e.g., Deci (1971)), though our effect is somewhat small (and not statistically significant, as we later show). Our smaller effect dovetails with psychology and economics research showing that experimental effect sizes are smaller in replications and in larger samples because of, for example, publication bias (e.g., OpenScienceCollaboration (2015), Camerer et al. (2016)). Indeed, the median sample size in 92 tests we reviewed in Appendix D was 14 subjects per condition, suggesting that the reported effect sizes are likely substantially inflated. 19 Those in the Anticipated condition supply 15 more minutes of time in the second session than the (77 versus 62 minutes, respectively). This gap remains positive, but shrinks to 3 minutes (60 versus 57 minutes, respectively), even when we consider the worst-case scenario of quitters supplying zero time. 20 Evidence for engagement with the task. One concern was that subjects might have self-selected into the blind-tasting disliking tasting (or tasting and evaluating) cookies, and only did so because of the benefits that the thank-you cookies conveyed: even though they 18 The combination of lower average time spent tasting of 11% versus the and higher average output of 12% versus the adds to an average excess productivity versus the of 23%. This magnitude is similar, though slightly lower, than the excess productivity of 29% versus the when comparing the average of individual productivities. The slight difference is due to rounding (the average output of 29 has been rounded from 29.2 and the average output of 26 has been rounded from 25.5, resulting in an actual 14% average output gain, instead of the stated 12%) and to the fact that the average of ratios is not equal to the ratio of averages. 19 Publication bias means that published studies often are underpowered to detect the true effect and thus only those studies which, by chance, have substantially large effects, are able to reach statistical significance and thus be publishable. Thus subsequent replications will yield no or smaller effect sizes. If the latter, the smaller effect sizes will only be statistically significant with substantially larger samples than those in the original study (see Ioannidis (2008) for more details). 20 We do not input the individual productivity (individual minutes spent on the task/individual output) of quitters since they supplied zero time and zero output and 0/0 is undefined. 16

18 were difficult to resell and thus to exchange for money, they could be used as a gift to a friend or relative, for example. The evidence that subjects in the spent an ample amount of time tasting cookies (one hour and 19 minutes in session one and one hour and two minutes in session two) and tasted a significant amount (an average of 30 cookies in session one and of 26 in session two), despite the thank-you cookies not being contingent on the time spent tasting or the number of cookies tasted, suggests that subjects were captivated with the task. It thus appears unlikely they selected into the task disliking tasting (or both tasting and evaluating) and did so merely for the benefits conveyed by the thank-you cookies. 3.2 Estimation and Consistency with Standard and Crowding-Out Models Having described output, productivity, as well as time spent on the task, we now test whether these outcomes differ statistically from those in the in either session (as for quits, we both described and tested them in the previous section). We show later that these results remain similar despite controlling for time-invariant campus and leg unobservables. Unadjusted estimation specification. The first specification estimates simple raw means for the different conditions across the sessions. We estimate the outcomes for subject i, in the t1 (), t2 (Unanticipated) and t3 (Anticipated) conditions in campus c, leg l, and session s as follows: outcome i,t,c,l,s = α 1,1 + α 1,2 t 1 s τ=2 j=1 2 β τ,j t τ s j + ɛ i,t,c,l,s (1) The parameters of interest are β t,s, which estimate unadjusted differences in average outcomes between the treatments and the per session. For example, β 2,1 identifies the difference between treatment two (Unanticipated) and the in session one. The parameter α 1,1 is the outcome for the baseline category: the in session one. Specification 1 simultaneously estimates outcomes for sessions one and two for each condition, with the advantage of allowing clustering of standard errors at the individual level. Clustering takes into account serial correlation in outcomes for each subject across sessions, yielding more conservative standard errors (Bertrand, Duflo, and Mullainathan (2004)). 21 First (reward) session output and productivity for both treatments. Table 3, Panel A, column (1), shows that the piece rate boosted output by an average 18 and 21 evaluations in the Unanticipated and Anticipated conditions, respectively, vis-à-vis the 21 Conducting a separate regression for session one and another for session two would yield the same point estimates as those above but would fail to take into account that each subject s performance across the two sessions could be serially correlated, which would bias the standard errors. 17

19 . These increases were statistically significant at the 1% level. Panel B, column (1) also shows that the piece rate increased productivity: it reduced the average time tasting and evaluating each cookie by 0.9 minutes in the treatments vis-à-vis the and these magnitudes were also statistically significant at the 1% level. Are the first (reward) session findings consistent with a standard or with a crowding-out model? The increases in output and productivity in the treatments are consistent with a standard model: agents work more when paid more. They could also be consistent with a crowding-out model: the piece rate may have crowded out the intrinsic marginal utility for tasting (or tasting and evaluating), but this effect was dominated by the incentive effect of the piece rate (Proposition 1 in Appendix E). Second (non-reward) session output and productivity for Unanticipated treatment. Table 3, Panel A, column (2), shows that those surprised by the withdrawal of the reward tasted slightly more than the (an extra four cookies), though this estimate is statistically insignificant. Panel B, column (2), shows that these subjects were also more productive than those in the, tasting and evaluating each cookie in 0.8 fewer minutes (48 seconds), a difference significant at the 1% level. 22 Are the Unanticipated condition findings for the second (non-reward) session consistent with a standard or with a crowding-out model? All findings, except the extra productivity vis-à-vis the are consistent with a standard model. The finding of similar (though slightly higher) output as the in the Unanticipated condition is consistent with a standard model: in the absence of the piece rate, output returns to the s level as the two groups face the same incentives (both groups output is mainly driven by the marginal utility of tasting or of tasting and evaluating cookies). However, it is not consistent with the crowding-out model which predicts lower output than the. The finding of excess productivity vis-à-vis the is not consistent with either a standard or a crowding out model. A standard model predicts a decline in productivity to the level of the in the absence of the piece rate. A crowding-out model predicts lower productivity than the given the undermining of marginal utility by the piece rate (Proposition 2 in Appendix E). We later discuss this finding in more detail. Finding that the quit rate is 4% lower but not statistically different from that in the 22 The excess of four cookies in the unadjusted specification, instead of the expected three evaluations showed in the raw means Table 2, column (5) is due to rounding: the average evaluations in the and Unanticipated conditions were 25.5 and 29.2, respectively. These numbers round to 26 and 29 respectively but their difference (3.7) rounds to

20 is consistent with a standard model: the expectation of a piece rate leads subjects to attend the second session despite the income effect from the piece rate in session one. However, it could also be consistent with a crowding-out model: the piece rate in session one eroded interest in the task, but the incentive effect of the piece rate once again dominated over the crowding out, curbing quits. Second (non-reward) session output and productivity for the Anticipated treatment. Table 3, Panel A, column (2), shows that those who came to the second session (78% of tasters), tasted more than those in the (an extra six cookies), though this difference is statistically insignificant. However, even considering quitters as producing zero output, the average output in the Anticipated condition still exceeds the s by one cookie (not statistically significant, as shown in Appendix Table A.2). Panel B, column (2), shows that productivity for these tasters was similar to those in the in the second session: they spent, on average, a statistically insignificant 0.1 fewer minutes (6 seconds) per evaluation than the. Are the Anticipated condition findings for the non-reward session consistent with a standard or with a crowding-out model? All findings are consistent with a standard model. First consider quits. The 14% higher quits vis-à-vis the (22% versus 8%) is consistent with a standard model: the income effect of the piece rate increased the consumption of leisure in the second session and the expectation of the piece rate is no longer present to induce subjects to return. It could also be consistent with crowding out: the piece rate in session one might have eroded interest in the task and the expectation of the piece rate is also no longer present to induce subjects to return. The finding of slightly higher (but not significant) output than the, even when incorporating quits is consistent with a standard model but not with a crowding-out one. The finding of productivity similar to the s for the 78% of non-quitters is also consistent with a standard model: conditional on returning, tasters in the Anticipated treatment display the same productivity as those in the as there is no longer the piece-rate incentive to boost their productivity. However, it does not rule out a crowding out model: for example, had the 22% of quitters come to the second session they could have had very low productivity be extremely slow at tasting and rating cookies dragging average productivity below that of the s, consistent with the crowding out. Brief note on the results of time spent on the task. Appendix Table A.3, column (1) shows that subjects in the Unanticipated and Anticipated conditions supplied more 19

21 time than the in session one (23 and 32 more minutes) and that these effects are statistically significant at the 5% and 1% levels, respectively. Columns (2) and (3) show that during session two, however, only those in the Unanticipated condition spent less time on the task (6 minutes less than the at 62 minutes), even when counting quitters as supplying zero time (3 minutes less than the, at 57 minutes). These results are qualitatively consistent with studies documenting crowding out via the undersupply of time; however, they are small in magnitude and not statistically significant. Those in the Anticipated condition spend more time on the task than the, even when counting quitters as supplying zero time (an excess of 3 minutes, not statistically significant). These results do not change even as we control for time-invariant campus, leg, and session heterogeneity, using the specification below. Summary of findings on the three main measures. The findings on average output and productivity in both treatments and sessions are consistent with a standard model, except for the excess average productivity in the Unanticipated condition in the second session, which is also inconsistent with crowding out. Further, because differences in means in output and productivity reflect patterns in whole distributions, the evidence in favor of a standard model is unlikely due to a few outliers or to insufficient power to detect output and productivity evidence in favor of crowding out. These findings do not change for the Anticipated condition even when taking into account its higher rate of quits in session two: (i) output in this condition is only consistent with a standard model, even when considering quitters as producing zero output in session two (the worst case), and (ii) the fact that productivity in this condition in session two could be consistent with crowding out if we assume that quitters would have had extremely low productivity, and in the limit spend infinite time per cookie tasted and evaluated it does not invalidate the fact that the observed productivity is consistent with a standard model. As for the evidence on the quit rates, they are also consistent with a standard model. Thus, on these three measures, we find no evidence that is consistent with crowding out but not consistent with a standard model; rather, we find the opposite. 3.3 Estimation: Adjusted Specification We now add time-invariant unobserved campus, leg, and session factors to specification 1 and show that the previous results on averages do not change (as the results are generally stable across campuses and legs, as shown in Appendix Table C.8). We add these factors because we randomized tasters into each condition within each leg and for each of the campuses (for 20

22 example, for leg two, we randomized those in campus A into the three conditions and did the same for campus B) and thus these factors could have biased our estimates. We estimate the outcomes for rater i, in the t1 (), t2 (Unanticipated) and t3 (Anticipated) conditions in campus c, leg l, and session s as follows: 3 2 outcome i,t,c,l,s = α 1,1 + α 1,2 t 1 s 2 + β τ,j t τ s j + λ c λ l λ s + ɛ i,t,c,l,s (2) τ=2 j=1 The interaction of campus, leg, and session, (λ c λ l λ s ) conservatively captures unobservable time-invariant, campus, leg, and session determinants of outcomes. Campus fixed effects control, for example, for unobserved heterogeneity in health consciousness by campus, which could influence the response to pay within campus and thus differences in outcomes between the treatments and the (e.g., tasters on a more health-conscious campus not increasing consumption and thus not producing more evaluations in response to the piece rate). Leg fixed effects control, for example, for the unobserved temperature, which could also affect the response to incentives (e.g., cookies may be less appealing in a leg occurring in a hot month). The interaction conservatively captures these effects on outcomes within a given campus, leg, and session. 23 The causal parameters of interest are the β t,s pooling the differences in outcomes between the treatments and the for each session, but now within a campus and leg. The parameter α 1,1 is the outcome for the baseline category in session one which cannot be separately identified from the interaction of the fixed effects, as usual. Table 3, columns (3) and (4), in Panels A and B show that the previous estimates on output and productivity and their statistical significance remain similar after adjusting for the above-described factors. Appendix Table A.3, columns (4)-(6), shows that the estimates on time on the task also remain similar after these adjustments. 3.4 The role of learning on productivity after the surprising withdrawal of pay All our findings across all conditions and sessions are consistent with a standard model except for one: the substantially higher productivity after the unexpected withdrawal of the piece rate. As a robustness check we explore whether the boost in productivity caused by the piece rate in the first session led to a permanent increase in productivity through learning, which subsequently persists even in the absence of pay. Though the 23 This interaction is conservative in that it subsumes stand-alone campus, leg or session fixed effects and their two-way interactions. 21

23 controls for such learning, it is still possible that working at high levels of productivity in the first session could have led tasters in the Unanticipated and Anticipated conditions to reach a permanently higher productivity threshold. Three pieces of evidence suggest this was not the case. First, had the piece rate induced a permanent increase in productivity through learning that withstood the absence of pay we should have observed a similar effect for the 78% non-quitters in the Anticipated condition and we did not. Table 4, Panel B, columns (1a) and (3a), row (4)) document that the 78% non-quitters in the Anticipated treatment showed average productivity similar to that of Unanticipated subjects during the first session: they tasted and rated each cookie 0.9 and 0.8 minutes faster, respectively, than subjects (p-value for the difference is depending on the specification). Further, this similarity in average productivity is again not due to a few outliers but to shifts in the whole distribution of tasters outcomes (bottom left panel in Figure 4). However, in the absence of the piece rate, these two groups behaved very differently: whereas productivity in the Unanticipated condition still exceeded that in the, productivity for non-quitters in the Anticipated condition declined to the level of the, as one would expect in the absence of pay. Second non-quitters in the Anticipated treatment turned in slightly more evaluations than those in the Unanticipated treatment during the reward session, suggesting they could have had even more experience with tasting and evaluating, though this difference is not statistically significant (p-values ranging of , depending on the specification in Table 4, Panel A, columns (1a) and (3a), row (4)). But only Unanticipated subjects continued working faster than those in the in the subsequent non-reward session. Third, quitters in the Anticipated condition were not faster raters than non-quitters. It could be that the faster raters in the first (reward) session of the Anticipated condition were the ones who quit and that is why the average productivity in the second session for non-quitters is lower, at the level of the. However, Appendix Table A.4 shows that during the first session quitters and non-quitters displayed the same productivity (2.3 minutes/evaluation, p-value=0.64) and the same output (51 cookies, p-value=0.90). Learning, therefore, is unlikely to account for the extra productivity of those experiencing the unexpected withdrawal of the reward. 4 Results on Quality Because learning is unlikely to explain the excess productivity we speculated that it could be driven by individuals displeased upon learning they were not going to be paid, tasting 22

24 their cookies (generating the same output as the ) but decamping the site as soon as possible, by shirking on their care in filling out the evaluations. This would be consistent with, for example, Mas (2006) and Kube, Maréchal, and Puppe (2013), who find that when wage expectations are not met workers shirk on one or several dimensions of the task (e.g., output or quality of the output) as they may, for example, retaliate or lose morale (e.g., Bewley (1999)). For example, Mas (2006) documented that when policemen fail to receive expected higher wages disputed under arbitration, their performance declines to levels below those prior to arbitration; Kube, Maréchal, and Puppe (2013) showed that when workers hired for a forecasted wage were told, upon arriving at the work site, that they would be paid less, they underperformed vis-à-vis a control group. We find suggestive evidence that workers appear to shirk on the quality of the evaluations when surprised by the withdrawal of the piece rate. One qualification on the analysis that follows, however, is that assessing the quality of subjective ratings is less straightforward than measuring output, productivity, and quits, and thus requiring appropriate caveats. Quality measure. Because it is hard to assess quality in surveys where the responses are subjective (there is no right or wrong answer), quality is often analyzed through dispersion: respondents economize on effort mainly by answering more questions at random (increasing dispersion) or by straightlining, that is, giving the same answers/ratings over consecutive questions (displaying no to little dispersion), according to the large literature on survey implementation in psychology and in education (e.g., Krosnick, (1991, 1999)). Since ratings were subjective, we used the dispersion in ratings for a fixed cookie to assess rating quality. Because the research assistant who perused the evaluations before paying the piece rate or giving away the thank-you cookies could easily detect straightlining, it was likely subjects would economize on effort via an increase in dispersion. Indeed, straightlining (e.g., giving a 2 on all dimensions) occurred rarely in our data (e.g., only 3.6% of the total number of evaluations had the same rating on all dimensions). Further, as we document in Appendix B, the speed in completing evaluations is correlated with higher dispersion in our task, not lower, suggesting agents chose the increased-dispersion option Two pieces of evidence support the idea that increased dispersion in ratings is a sign of lower rather than higher effort. First, shirking by reducing the variance in ratings is more easily detected by the research assistant than shirking by increasing the variance. Specifically, conditional on these two strategies leading to a similar economy of effort, 2, 2, 2, 2... or 2, 1, 1, 2... sequences, for instance, are easier to perceive in a visual inspection than a 5, 2, 4, 1,... sequence (consistent with only 3.6% of the total number of evaluations having the same rating on all dimensions). Second, if a faster completion of evaluations is correlated with more carelessness via little dispersion, then we should observe a positive correlation between 23

25 Our measure of dispersion is the standard deviation in the ratings of a cookie. 25 each evaluation, we computed the standard deviation in ratings across the seven dimensions: Appearance, Aroma, Snap, Texture, Start, Flavor, and Overall Rating. The scale for each dimension ran from 1 (Excellent) to 5 (Poor). Therefore, a cookie rated 1, 3, 4, 2, 5, 5, 3 had a standard deviation of 1.5, whereas a cookie rated 2, 2, 2, 2, 2, 2, 2 had a standard deviation of 0. However, tasters rarely awarded the same rating to each dimension. Sample. The goal was to assess whether the dispersion in ratings for a fixed cookie (e.g., an Oreo Chocolate) could differ across the three conditions. It is important to hold the cookie fixed across the three conditions to ensure that differences in dispersion are not due to differences in cookies themselves Thus, our analysis focuses on cookies that were sampled in the three treatments at a given campus, leg, and session so that we can compare that cookie s dispersion in ratings across the three conditions within that campus, leg, and session. Although all subjects within a campus, leg, and session received the same 70 cookies, they did not all taste exactly the same cookies. Subjects tasted different numbers of cookies and, within those, some cookies were more likely to be tasted, as they had been randomly placed in trays closer to a subject. Further, to focus on dispersion differences within the same cookie we excluded cookies for which the same identification number could refer to more than one type of cookie, as is the case in cookie assortments. 26 Graphical evidence per condition and session. The top left panel in Figure 5, depicting the empirical CDFs of the standard deviation in ratings for an evaluation during speed and dispersion: the less time spent filling in each evaluation, the more the strainghtlining and thus the lower the dispersion. Appendix B shows, however, that observe an inverse correlation: the less time spent filling in each evaluation, the higher the dispersion, all else equal. Thus, it appears that subjects chose the increased-dispersion option to economize on effort. 25 The use of other measures, such as the variance and range, yielded qualitatively similar results. 26 Thus the analysis drops 5 and 12 subjects in sessions one and two, respectively. 27 As a result, extrapolating the dispersion analysis below on this more restricted sample to the full sample relies on the assumption that had these tasters not been dropped because of not tasting the same cookies as other tasters or of sampling more cookie assortments our dispersion results would have been similar. Though we naturally cannot test whether dropping these subjects biases the dispersion results, we can do so for output and productivity. Appendix Table A.5 shows that output and productivity differences across conditions and sessions on the restricted and full samples are similar, suggesting the exclusion of these subjects was orthogonal to output and productivity differences. Thus the exclusion of these tasters might have also been orthogonal to differences in dispersion for the same cookie: the dispersion results we document below for a fixed cookie across sessions and conditions in the restricted sample might not have been too dissimilar from those in the full sample, had we been able to observe them. Nonetheless, an extrapolation to the full sample should be viewed with this caveat. For 24

26 the reward session, without holding a specific cookie fixed, shows that for the most part the CDFs overlap or lie close to each other. However, during the non-reward session, the CDF for tasters surprised by the withdrawal of pay (Unanticipated condition) stochastically dominates that for the and Anticipated conditions, suggesting that the Unanticipated condition s higher average standard deviation compared to both others was driven by a shift in the CDF s. Estimation specification. Beyond documenting unadjusted average differences in the dispersion of ratings per evaluation across conditions, resulting from the above-described CDFs, we also document how average dispersion in ratings for the same cookie, tasted within a given campus, leg, and session varies across conditions. To do so, we estimated the dispersion in the ratings of a fixed cookie k, for subject i, in the t1 (), t2 (Unanticipated), t3 (Anticipated) conditions at campus c, leg l, and session s as follows: 3 2 dispersion k,i,t,c,l,s = α 1,1 + α 1,2 t 1 s 2 + β τ,j t τ s j + λ c λ l λ s λ k + ɛ k,i,t,c,l,s (3) τ=2 j=1 The interaction of campus, leg, session, and cookie fixed effects (λ c λ l λ s λ k ) conservatively captures unobserved time-invariant, campus, leg, session, and cookie determinants of standard deviation. Campus fixed effects control, for example, for whether subjects on one campus are less conscientious than those on the other, thus showing a higher propensity to respond at random, which affects the differences in dispersion between the treatments and the within campuses. Leg fixed effects address, for example, whether during a hot summer leg, some cookies have a higher tendency to melt, leading to more dispersion in their ratings for this leg than for others (e.g., their Appearance rating would be worse but their Flavor rating would remain unchanged). Cookie fixed effects control for unobserved time-invariant differences in cookie characteristics that could lead to differences in dispersion, such as a cookie s having a good appearance but a bad flavor. The interaction addresses whether these unobservables could differentially affect outcomes within campus, leg, session, and cookie (e.g., whether a hot summer leg affects the dispersion of a given cookie more on one campus than on the other). The parameters of interest are β t,s, which estimate, for a given campus, leg, and session, how the dispersion for the same cookie differs between the treatments and the. For example, β 2,1 identifies the difference in the standard deviation for a cookie in treatment two in session one versus the standard deviation for that same cookie in the in session one, by pooling all these differences within each campus and leg for this fixed cookie. 25

27 As usual, α 1,1 is the baseline category: the outcome for the in session one, which cannot be separately identified from the fixed effects. Finally, we conservatively cluster the standard errors by individual to address the potential correlation in ratings dispersion for a given subject within a session (because each individual produced several evaluations in each session) and across sessions (because dispersion in filling out the evaluations is likely correlated across sessions for the same individual) (Bertrand, Duflo, and Mullainathan (2004). Ratings standard deviation during the first (reward) session. Table 5 shows that the standard deviation of the ratings during the reward period is not statistically different between the treatments and the and is similar between the treatments. As shown in column (3), the standard deviation for the same cookie rated at a given campus and leg during the reward session for the Unanticipated and Anticipated conditions is only 0.03 and 0.01 higher, respectively, than that in the and is not statistically significant. Further, these estimates are similar to the unadjusted estimates in column (1) (for reference, the s unadjusted average was 0.74). These findings suggest that though the piece rate induced tasters to work faster during the reward session, as shown previously, it did not damage the ratings dispersion, our measure of quality. Further, row (4) shows there is no statistical difference in dispersion across the treatments (p-values of ). Ratings standard deviation during the (second) non-reward session. However, column (4) shows that during the subsequent non-reward period, subjects surprised by the payment s withdrawal increased the dispersion in ratings for the same cookie relative to the. The standard deviation for the same cookie rated at a given campus and leg during the non-reward session for surprised tasters was 0.10 higher, more than tripling in magnitude relative to that in the reward session, and significant at the 5% level (the unadjusted difference of 0.08 in column (2) was also significant, albeit at the 10% level). The dispersion in ratings for that same cookie when subjects expected the reward to be withdrawn (Anticipated treatment) is not higher than the s. Rather, it is slightly smaller at (not statistically significant). 5 Discussion and Conclusion The goals of this test were, first, to obtain a clearer and more comprehensive view of how incentive pay may undermine performance on enjoyable tasks that benefit primarily agents and principals (and not a third party); and, second, if this undermining were to be observed, to explore the extent to which unmet pay expectations might have contributed to it. 26

28 The first goal arose from the fact that there is conflicting evidence on whether pay damages performance on enjoyable tasks and the concern that this could arise from different studies reporting different outcomes. For example, the psychology meta-analysis by Deci, Koestner, and Ryan (1999) found that, for a sample of studies, the balance of the evidence supported crowding out whereas the meta-analysis by Cameron, Banko, and Pierce (2001) found that, for another sample of studies, it did not. 28 The few studies in economics on whether crowding out harms performance on tasks that primarily benefit agents and principals also reach different conclusions (see footnote 6). Part of the reason for the conflicting evidence could be the lack of comprehensive reporting of outcomes. Crowding out studies use many measures, such as time spent on the task (the most used), output, productivity, willingness to supply future work, among others, but the typical experiment only reports one or two performance-related outcomes, such as productivity, output, or quits. Notably, to our knowledge, no prior experiment has jointly reported output, productivity, and time spent on the task, which is important. First, these interconnected measures (productivity results from the ratio of output to time) may yield conflicting evidence: for example, a reduction in time spent on the task (consistent with crowding out) that leaves output unchanged (inconsistent with crowding out) increases productivity (also inconsistent with crowding out). 29,30 Second, reporting these three measures may render the interpretation of whether pay harms performance more nuanced and difficult how should we interpret, in the context of crowding out, a reduction of time spent on the task that leaves output unchanged and boosts productivity? Because whether one documents crowding out or not depends on which outcomes are reported and different studies report different outcomes, this could have contributed to the conflicting evidence. 28 These meta-analyses disagreed, among others, on which studies should be included in their meta-samples; whether the tasks used in several studies were enjoyable so that incentives had scope to damage performance; and whether effect sizes were properly computed. 29 None of the experiments in the 92 studies we reviewed in psychology in Appendix D and in footnote 6 and none of the economics experiments reviewed in footnote 6 jointly report these three outcomes. Studies in psychology often report other metrics, but these do not measure performance per se. Rather they attempt to gauge subjects feelings for the activity, such as their self-perceived liking or competence for the task, in order to uncover the cognitive mechanisms that underpin subjects behavior. 30 There are other combinations of time on the task, output and productivity that could yield contradictory evidence. For example, one might find that pay reduced productivity (consistent with crowding out) due to an increase of time spent on the task (inconsistent with crowding out) while leaving output unchanged (inconsistent with crowding out). Or that pay reduced output and time on the task (both consistent with crowding out), but time decreased faster than output so that productivity increased (inconsistent with crowding out). 27

29 This selective reporting of outcomes fuels a further concern regarding the robustness of the evidence on crowding out: that this evidence may be populated with false-positives. This is because recent work has documented that one of the causes for the proliferation of falsepositives in psychology is researchers choice of which outcomes to disclose (Simmons, Nelson, and Simonsohn (2011), Simonsohn, Nelson, and Simmons (2014a), Simonsohn, Nelson, and Simmons (2014b)). 31 Further adding to the discussion on the robustness of the crowding out evidence are two recently documented facts concerning replications of laboratory tests in psychology: (i) only 36% of the effects were statistically significant and in the same direction as those in the original studies and (ii) effect sizes were generally inflated (OpenScienceCollaboration (2015)). This is partially due to publication bias: tests are often underpowered to detect the true effect and thus only those studies which, by chance, have substantially large effects, are able to reach statistical significance and thus be published. As a result, subsequent replications yield no or smaller effect sizes (e.g., Ioannidis (2008)). The crowding out literature appears to share this issue of underpowered samples: of the 92 studies we reviewed the median sample size per condition was 14 subjects (see row 93 of Appendix D, which reviews these studies). Given these issues, we first aimed to be as transparent as possible by reporting several performance measures of economic interest for a principal, such as output, productivity and quits, where the latter leverages our two-period design. We also report time spent on task: although it is a more fragile indicator of performance, it is the most used metric in crowding out tests and it complements our output and productivity analyses. Relatedly, we also targeted a larger sample size than is typical in this literature to analyze these outcomes. Second, we aimed to assess, had we observed that pay harmed performance, the potential role of unmet pay expectations. We find that across both treatments and both sessions output, productivity and quits results are consistent with a standard model, except for one: productivity is substantially higher than the s after the unexpected withdrawal of the reward. Importantly, these results are unlikely due to low statistical power. The finding of higher productivity than the after the unexpected withdrawal of pay it is puzzling and inconsistent with either a standard or a crowding-out model. As it seemed odd that the surprising withdrawal of pay would boost productivity (via a slightly 31 A replication study of experimental tests in economics found larger replication rates statistically significant effects in the same direction of the original studies in 61% of cases but that the effect sizes also tended to be inflated (Camerer et al. (2016)). 28

30 higher number of cookies tasted and evaluated but in less time), we speculated that this boost might entail hidden costs for the principal: that tasters might be cutting corners on filling in the evaluations in order to leave the work site earlier. The idea that the unexpected withdrawal of pay could lead workers to shirk on one or more dimensions of the task, such as output or quality stemmed from prior research (e.g., Bewley (1999), Mas (2006), Kube, Maréchal, and Puppe (2013)). Our more tentative evidence on the quality of the evaluations (measured by an increase in the randomness of the ratings) supports the idea that quality is lower than that in the when the piece rate is unexpectedly withdrawn in session two. And this quality deficit only occurs in the Unanticipated condition in session two: we find no such shortfall in other sessions and treatments. Our more speculative evidence on quality could also be consistent with crowding out, but under less plausible assumptions, such as that rating carefully was the only enjoyable component of the task and that pay undermined it. Although our study does not purport to challenge the whole literature on crowding out, it does point to the importance of reporting several performance measures to assess the effect of pay on enjoyable tasks, in particular if they are interrelated such as output, productivity, and time spent on the task. This would bring transparency to this research and allow for a richer picture concerning the extent to which pay may undermine performance in these activities. Although our study advances in this direction, there is room for improvement. Our sample sizes, though larger than what is typical in this research, were not very large as we relied on this prior literature which finds effects even in small samples for power calculations. We nevertheless had enough power to document statistically significant responses to incentives and their removal. Further, we also disclose several performance measures, even those with the potential to yield conflicting evidence. Future studies should aim not only to report on a variety of outcomes, in particular if they are interrelated and have to potential of yield conflicting evidence, but also to have larger samples. This way we can obtain a better-informed view on the extent to which paying agents to perform jobs they enjoy harms their performance. References Abeler, J., A. Falk, L. Goette, and D. Huffman (2011): Reference Points and Effort Provision, American Economic Review, 101, Ariely, D., A. Bracha, and S. Meier (2009): Doing Good or Doing Well? Image Mo- 29

31 tivation and Monetary Incentives in Behaving Prosocially, American Economic Review, 99(1), Ashraf, N., O. Bandiera, and B. K. Jack (2014): No Margin, No Mission? A Field Experiment on Incentives for Public Service Delivery, Journal of Public Economics, 120, Baron, J., and D. Kreps (1999): Strategic Human Resources: Frameworks for General Managers. Wiley, first edition edn. Bell, D. (1985): Disappointment in Decision Making Under Uncertainty, Operations Research, 33(1), Bénabou, R., and J. Tirole (2003): Intrinsic and Extrinsic Motivation, Review of Economic Studies, 70(3), Bénabou, R., and J. Tirole (2006): Incentives and Prosocial Behavior, American Economic Review, 96(5), Bertrand, M., E. Duflo, and S. Mullainathan (2004): How Much Should We Trust Differences-In-Differences Estimates?, Quarterly Journal of Economics, 119(1), Bewley, T. F. (1999): Why Wages Don t Fall During a Recession. Harvard University Press. Boal, K. B., and L. Cummings (1981): Cognitive EvaluationTheory: An Test of Processes and Outcomes, Organizational Behavior and Human Performance, 28(3), Calder, B. J., and B. M. Staw (1975): Interaction of Intrinsic and Extrinsic Motivation - Some Methodological Notes, Journal of Personality and Social Psychology, 31(1), Camerer, C. F., A. Dreber, E. Forsell, T.-H. Ho, J. Huber, M. Johannesson, M. Kirchler, J. Almenberg, A. Altmejd, T. Chan, E. Heikensten, F. Holzmeister, T. Imai, S. Isaksson, G. Nave, T. Pfeiffer, M. Razen, and H. Wu (2016): Evaluating Replicability of oratory Experiments in Economics, Science. Cameron, J., K. Banko, and W. Pierce (2001): Pervasive Negative Effects of Rewards on Intrinsic Motivation: The Myth Continues, The Behavior Analyst, 24(1), 1. Cameron, J., and W. D. Pierce (1994): Reinforcement, Reward, and Intrinsic Motivation: A Meta-Analysis, Review of Educational Research, 64(3),

32 Charness, G., and M. Rabin (2002): Understanding Social Preferences with Simple Tests, Quarterly Journal of Economics, 117(3), Chetty, R., E. Saez, and L. Sandór (2014): What Policies Increase Prosocial Behavior? An Experiment with Referees at the Journal of Public Economics, NBER Working Paper Crawford, V., and J. Meng (2011): New York City Cab Drivers or Supply Revisited: Reference-Dependent Preferences with Rational-Expectations Targets for Hours and Income, American Economic Review, 101(5), Deci, E. (1971): Effects of Externally Mediated Rewards on Intrinsic Motivation, Journal of Personality and Social Psychology, 18(1), Deci, E., R. Koestner, and R. Ryan (1999): A Meta-Analytic Review of Experiments Examining the Effects of Extrinsic Rewards on Intrinsic Motivation, Psychological Bulletin, 125, Deci, E., and R. Ryan (1985): Intrinsic Motivation and Self-Determination in Human Behavior. Springer. Dufwenberg, M., and G. Kirchsteiger (2004): A Theory of Sequential Reciprocity, Games and Economic Behavior, 47(2), Ericson, K., and A. Fuster (2011): Expectations as Endowments: Evidence on Reference-Dependent Preferences from Exchange and Valuation Experiments, Quarterly Journal of Economics, 126, Falk, A., and U. Fischbacher (2006): A Theory of Reciprocity, Games and Economic Behavior, 54(2), Gächter, S., and A. Falk (2002): Reputation and Reciprocity: Consequences for the our Relation, Scandinavian Journal of Economics, 104, Gibbons, R. (1998): Incentives in Organizations, The Journal of Economic Perspectives, 12(4), Gill, D., and V. Prowse (2012): A Structural Analysis of Disappointment Aversion in a Real Effort Competition, American Economic Review, 102(1), Gneezy, U., S. Meier, and P. Rey-Biel (2011): When and Why Incentives (Don t) Work to Modify Behavior, Journal of Economic Perspectives, 25(4), Gneezy, U., and A. Rustichini (2000): Pay Enough or Don t Pay at All, Quarterly Journal of Economics, 115(3),

33 Goswami, I., and O. Urminsky (2017): The Dynamic Effect of Incentives on Postreward Task Engagement, Journal of Psychology, 146(1), Groves, R., S. Presser, and S. Dipko (2004): The Role of Topic Interest in Survey Participation Decisions, Public Opinion Quarterly, 68(1), Gul, F. (1991): A Theory of Disappointment Aversion, Econometrica, 59(3), Holland, J., and L. Christian (2009): The Influence of Topic Interest and Interactive Probing on Responses to Open-Ended Questions in Web Surveys, Social Science Computer Review, 27(2), Hossain, T., and K. K. Li (2014): Crowding Out in the or Market: A Prosocial Setting is Necessary, Management Science, 60(5), Huffman, D., and M. Bognanno (2014): Does Performance Pay Crowd Out Worker Non-Monetary Motivations? Evidence from a Real Work Setting, Working Paper. Ioannidis, J. P. A. (2008): Epidimiology, 19, Why Most Discovered True Associations Are Inflated, Kahneman, D., and A. Tversky (1979): Prospect Theory: An Analysis of Decision under Risk, Econometrica, 47(2), Kamenica, E. (2012): Behavioral Economics and Psychology of Incentives, Annual Review of Economics, 4(1), Kőszegi, B. (forthcoming): Behavioral Contract Theory, Journal of Economic Perspectives. Kőszegi, B., and M. Rabin (2006): A Model of Reference-Dependent Preferences, Quarterly Journal of Economics, 121(4), (2007): Reference-Dependent Risk Attitudes, American Economic Review, 97(4), Krosnick, J. A. (1991): Response Strategies for Coping With the Cognitive Demands of Attitude Measures in Surveys, Applied Cognitive Psychology, 5(3), (1999): Survey Research, Annual Review of Psychology, 50(1), Kube, S., M. Maréchal, and C. Puppe (2013): Do Wage Cuts Damage Work Morale? Evidence From a Natural Field Experiment, Journal of the European Economic Association, 11(4),

34 Lacetera, N., M. Macis, and R. Slonim (2012): Will There Be Blood? Incentives and Displacement Effects in Pro-Social Behavior, American Economic Journal-Economic Policy, 4(1), Lazear, E. (2000): Performance Pay and Productivity, American Economic Review, pp Lazear, E., and M. Gibbs (2014): Personnel Economics in Practice. Wiley, third edition edn. Lepper, M., D. Greene, and R. Nisbett (1973): Undermining Children s Intrinsic Interest with Extrinsic Reward-Test of the Overjustification Hypothesis, Journal of Personality and Social Psychology, 28(1), Loomes, G., and R. Sugden (1986): Disappointment and Dynamic Consistency in Choice under Uncertainty, Review of Economic Studies, 53(2), Macera, R., and V. te Velde (2016): On the Power of Gifts,. Mas, A. (2006): Pay, Reference Points, and Police Performance, Quarterly Journal of Economics, 121(3), Mas, A., and E. Moretti (2009): Peers at Work, American Economic Review, 99(1), Mellström, C., and M. Johannesson (2008): Crowding Out in Blood Donation: Was Titmuss Right?, Journal of European Economic Association, 6(4), OpenScienceCollaboration (2015): Estimating the Reproducibility of Psychological Science, Science, 349(6251). Pope, D. G., and M. E. Schweitzer (2011): Is Tiger Woods Loss Averse? Persistent Bias in the Face of Experience, Competition, and High Stakes, American Economic Review, 101, Prendergast, C. (1999): The Provision of Incentives in Firms, Journal of Economic Literature, 37(1), Rabin, M. (1993): Incorporating Fairness into Game Theory and Economics, American Economic Review, pp Rebitzer, J. B., and L. J. Taylor (2011): Extrinsic Rewards and Intrinsic Motives: Standard and Behavioral Approaches to Agency and or Markets, Handbook of or Economics, 4,

Ryan, R., and E. Deci (2000): Intrinsic and Extrinsic Motivations: Classic Definitions and New Directions, Contemporary Educational Psychology, 25, 54 67. Shalev, J.

35 Ryan, R., and E. Deci (2000): Intrinsic and Extrinsic Motivations: Classic Definitions and New Directions, Contemporary Educational Psychology, 25, Shalev, J. (2000): Loss Aversion Equilibrium, International Journal of Game Theory, 29(2), Simmons, J., L. Nelson, and U. Simonsohn (2011): False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant, Psychological Science, 22(11), Simonsohn, U., L. Nelson, and J. Simmons (2014a): P-Curve: a Key to the File Drawer, Journal of Psychology: General, 143(2), (2014b): P-Curve and Effect Size: Correcting for Publication Bias Using Only Significant Results, Perspectives on Psychological Science, 9(6), Figures Figure 1: Room Layout for Cookie Tasting 34

UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN BOOKSTACKS

UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN BOOKSTACKS Digitized by the Internet Archive in 2012 with funding from University of Illinois Urbana-Champaign http://www.archive.org/details/interactionofint131cald