A Naturalistic Approach to Adversarial Behavior: Modeling the Prisoner s Dilemma

Size: px

Start display at page:

Download "A Naturalistic Approach to Adversarial Behavior: Modeling the Prisoner s Dilemma"

Christian Houston
6 years ago
Views:

1 A Naturalistic Approach to Adversarial Behavior: Modeling the Prisoner s Dilemma Amy Santamaria Walter Warwick Alion Science and Technology MA&D Operation 4949 Pearl East Circle, Suite 300 Boulder, CO asantamaria@alionscience.com, wwarwick@alionscience.com Keywords: Naturalistic Decision Making, Human Performance Modeling, Prisoner s Dilemma ABSTRACT: In this paper we present a model of the prisoner s dilemma in which human-like, adversarial behaviors emerge from interactions between two naturalistic models. In the prisoner s dilemma, like other simple two-player games, the behaviors that arises from iterated play can be surprisingly complex. We developed a model of this task using an integrated architecture, with a task network representing the decision environment and an underlying naturalistic model of decision making determining branching in the task network. Several adaptive strategies emerged in our two-player models. In addition, we captured several qualitative effects of human performance. 1. Introduction In our previous work we have focused on demonstrating how simple, naturalistic mechanisms can produce interesting emergent behaviors that match human performance (Warwick 2006; Warwick & Santamaria, 2006; Warwick & Hutchins, 2004). Our goal in this work was to study the more complex adversarial behaviors that emerge when we play two naturalistic models against each other, with each model making decisions based on its opponent s past behavior. In this paper, we investigate how two such models interact in a common environment and generate meaningful emergent behaviors. In particular, we examined the prisoner s dilemma, a non zero-sum game studied in game theory and economics, which has been applied to many fields, including psychology, sociology, political science, and philosophy. The prisoner s dilemma is interesting from the viewpoint of military strategy because it is an example of dynamic, reciprocal, adversarial interaction based on a few simple rules. In fact, the prisoner s dilemma has been used to study arms races and other real-life military and political conflicts (see Lumsden, 1973; Smith, Sola, & Martin, 2000). It is also interesting from a naturalistic viewpoint because actual human behavior often diverges from what is predicted by a rational analysis of the prisoner s dilemma. Here we follow in the footsteps of other cognitive modelers who have demonstrated that general assumptions about the nature of cognition can lead to surprising predictions about human behavior that would be hard to come by given the normative analysis of game theory. For example, Lebiere, Wallach, and West (2000) show how memory can account for many of the qualitative aspects of adversarial behavior in 2x2 game play. That important work sets the context for our own work. 1.1 The Prisoner s Dilemma The prisoner s dilemma is a two-player game that has been used to examine adversarial behavior. In the classic form of the game, two people are arrested for a crime, but there is not enough evidence to convict without the testimony of at least one of them. Each is offered a deal: testify against the partner, and if the partner stays silent, the one who testifies goes free and the partner gets the maximum sentence. If both prisoners agree to testify, both get a reduced sentences. If both prisoners stay silent, both get a minimum sentence. The two prisoners cannot communicate about their choice. The optimal solution is for both prisoners to stay silent. However, from the perspective of a single prisoner who does not know what his partner will do, it makes more sense to agree to testify. If the prisoner agrees to testify and his partner stays silent, then he goes free. If his partner also agrees to testify, then he avoids the maximum

2 sentence. The prisoner s dilemma can also be represented abstractly as a 2x2 game (two players with two choices) with a payoff structure. Two players, who cannot communicate, can choose to either cooperate or defect, and their goal is to maximize their payoff. We used the payoff structure shown in Table 1. (The specific numeric values of the pay-offs are incidental; it only matters that the matrix preserves certain ratios between pay-offs): A Cooperate B Cooperate A Payoff=3 B Payoff =3 B Defect A Payoff =0 B Payoff=4 A Defect A Payoff=4 B Payoff =0 A Payoff =1 B Payoff=1 Table 1.1 Payoff structure for the prisoner s dilemma. presented. A similarity value (i.e., a dot product) is computed between the vector representing the new situation and each remembered situation in long term memory. The resulting similarity value is then raised to a user-defined power and used to determine the proportionate contribution that each remembered episode makes to a composite recollection. The result is then analyzed to produce an output corresponding to a specific course of action, which is implemented, evaluated, and stored as a new episode in long term memory for use in the next decision. Our task network model of the prisoner s dilemma is depicted in Figure 2.1. Only the skeleton of the task is visible because the two naturalistic decision making models, one for each of the two players, are implemented as plug-in components to the task network. (The structure of the naturalistic decisions is discussed in more detail in section 3.) While the behavior that emerges from a single iteration of the prisoner s dilemma is trivial, the adversarial behavior that emerges from iterated play can be surprisingly complex and subtle. Given two adaptive players, a type of feedback becomes possible in which the each player is simultaneously influencing and influenced by the play of the other. The behaviors that emerge from this so-called reciprocal causation (Lebiere & West, 1999) can be very difficult to unpack because of their dynamic, circular nature something like a chicken-and-egg problem with a dash of revenge. 2. Model Description We developed our models using a Micro Saint Sharp task network environment augmented with a naturalistic model of decision making (see Warwick, McIlwaine, Hutton & McDermott, 2001). In this integrated architecture, the task network is used to represent the decision environment, and the naturalistic model uses a similaritybased recognition to determine a course of action in the task network, using a store of decision making episodes which is populated by experience. Our approach to modeling naturalistic decision making was inspired by Klein s (1998) theory of the recognitionprimed decision and implemented by extending Hintzman s multiple-trace memory model (1984; 1986a; 1986b). A decision maker s long-term memory is represented as a set of episodes, each of which represents the situation that prompted a decision (encoded as a cue vector), the course of action taken, and an outcome measure (either successful or unsuccessful). There may be multiple copies of identical experiences in long-term memory. Recognition occurs when a new situation is Figure 2.1 Micro Saint task network model of the prisoner's dilemma. The heavy lines branching into Defect and Cooperate indicate where information is sent to a naturalistic decision making model. 3. Method On each trial, the modeled player must decide between defecting and cooperating. The naturalistic model learns by reinforcement; successful decisions are rewarded while unsuccessful decisions are punished. Thus, it was necessary to define criteria (transparent to the model itself) for a successful decision. Is success beating your opponent? Is it doing as well as your opponent? Is it beating some threshold? We investigated several different definitions of success: MATCH, BEAT and OPPORTUNISTIC. In MATCH, a player s decision is successful if the payoff is greater than or equal to its opponent s payoff. This rewards mutual

3 cooperation, mutual defection, or defection when the opponent cooperates. In BEAT, a player s decision is successful if the payoff is greater than its opponent. This never rewards cooperation, and it only rewards defection when the opponent cooperates. In OPPORTUNISTIC, a player s decision to defect is rewarded only if his opponent cooperates, while a decision to cooperate is rewarded only if his opponent does not defect. The first two definitions are straightforward, while the third is a bit more subtle in terms of the behavior it engenders. In addition to defining the various success criteria for the model, it was also necessary to define cue structures, the information the decision maker uses to prompt recognition. Our default model used a set of six cues, representing its own and its opponent s plays on the past three trials. 3.1 Testing for Emergent Strategies Our first goal was to observe the emergence of strategies as this default player adapted to its opponent s behavior. We tested the default player against two types of opponents: opponents whose moves were determined purely probabilistically and opponents with varying memory capacities. Each run of the model was 100 trials long, or 100 sequential prisoner s dilemma choices. First, we tested the default Player A against a player whose moves were determined probabilistically. Player B, the probabilistic player, defected with a probability of 0, 0.25, 0.5, 0.75, or 1, depending on the condition, and Player A was rewarded according to MATCH or BEAT definitions of success, for a total of 10 conditions. Each condition was run with 10 different random seeds, for a total of 100 runs (see Table 3.1). Player B Strategy BEAT MATCH P(defect) = runs 10 runs P(defect) = runs 10 runs P(defect) = runs 10 runs P(defect) = runs 10 runs P(defect) = runs 10 runs Table 3.1 Testing the model against a probabilistic opponent. Second, we tested the default Player A against players with limited memory (i.e., shorter cue vectors) under different combinations of success criteria. Fixing the memory capacity of the opponent, Player B, to only one previous trial, we tested all combinations of MATCH and BEAT definitions of success for both players, for a total of 4 conditions (MATCH-MATCH, BEAT-BEAT, MATCH-BEAT, and BEAT-MATCH). Each condition was run with 10 different random seeds, for a total of 40 runs (see Table 3.2). A: BEAT A: MATCH B: BEAT 10 runs 10 runs B: MATCH 10 runs 10 runs Table 3.2 Testing the model (Player A) against a limited memory opponent (Player B). 3.2 Evaluating the Model s Choices Our second goal was to evaluate the model s choices in order to make a comparison to human performance on the prisoner s dilemma task. To do this, we fixed the success criteria to OPPORTUNISTIC and varied the memory of the opponent, Player B, by systematically reducing the number of previous moves used as cues. Each condition was run with 20 different random seeds, for a total of 60 runs (see Table 3.3). Player B Cues OPPORTUNISTIC 0,1 20 runs 0,2 20 runs 0,3 20 runs Table 3.3 Testing the model against a limited memory opponent using OPPORTUNISTIC criteria. Executable versions of the different integrated models can be inspected and downloaded from: 4. Results and Discussion 4.1 Testing for Emergent Strategies To understand the results, it is helpful to understand the optimal strategy for both definitions of success. With MATCH as the definition of success, the optimal strategy is to always defect. Defecting means your payoff is either higher or the same as your opponent s payoff, both of which are considered successful in MATCH. With BEAT as the definition of success, the optimal strategy is to switch between defecting and cooperating. Because in BEAT your payoff has to be strictly larger than your opponent s payoff, defecting only results in success if your opponent cooperates. To encourage your opponent to cooperate, you must also cooperate some of the time. When the default Player A was tested against a probabilistic opponent with MATCH as the definition of success (see Figure 4.1), it showed that it was able to learn from its opponent s behavior and to choose the

4 appropriate strategy. When the opponent s probability of defecting was zero (it always cooperated), Player A switched between defecting and cooperating, defecting about half the time. When the opponent s probability of defecting was one (it always defected), Player A learned to defect most of the time. In the intermediate condition where the opponent defected 75 percent of the time, Player A tended to defect more than not; in the other intermediate conditions, Player A switched between cooperation and defection. Figure 4.2 The default player against a probabilistic opponent using the BEAT definition of success. Figure 4.1 The default player against a probabilistic opponent using the MATCH definition of success. When the default Player A was tested against a probabilistic opponent with BEAT as the definition of success (see Figure 4.2), it showed the reverse pattern. When the opponent s probability of defecting was one (it always defected), Player A switched between defecting and cooperating, defecting about half the time. When the opponent s probability of defecting was zero (it always cooperated), Player A learned to defect most of the time. As before, the intermediate conditions led to mixed behaviors. The default Player A was also able to converge on adaptive strategies when it was matched with a player with reduced memory. When success was defined as MATCH for both players, Player A learned to always defect for all of the ten runs, while Player B only learned this for only four of the ten runs. When success was defined as BEAT for both players, Player A learned to switch between defecting and cooperating on all of the ten runs, while Player B failed to settle on a strategy. When success was defined as BEAT for Player A and MATCH for Player B, Player A learned to defect about half the time on all of the ten runs, while Player B failed to settle on a strategy. When success was defined as MATCH for Player A and BEAT for Player B, Player A chose to defect most of the time, while Player B failed to settle on a strategy. 3.2 Evaluating the Model s Choices While the emergence of these various strategies demonstrates the naturalistic model s ability to adapt in adversarial contexts, it is still not clear whether this ability to adapt reflects human-like performance. To get at this question, we had to devise a way to score outcomes of each decision. In doing this, we hoped to reconcile our work with previous attempts to model simple games (e.g., Lebiere & West, 1999; West, Lebiere & Bothell, 2006). Toward this end, we fixed the success criteria for both Players A and B to OPPORTUNISTIC, and we defined a cumulative outcome measure such that defecting when the opponent cooperated added 1 point to a player s score, cooperating when the opponent defected subtracted 1 point from a player s score, and mutual cooperation or

5 defection resulted in no change to the score. We then ran the default Player A (with a small change to the cue structure in which only two, rather than three of the players own previous moves were remembered) against opponents of varying memory capacity. a model that implements a fixed tit-for-tat strategy. The explanation again lies in the complexity of the patternmatching. More cues mean more complex patterns to learn, which ultimately slows down the leaning process. 5. Conclusions The goal of this work has been to demonstrate how fairly simple mechanisms can engender surprisingly complex emergent behaviors that are adaptive and that reflect some of the qualitative aspects of human game play performance. Compared to previous attempts using cognitive models to account for adversarial behavior, our approach lacks some theoretical overhead (e.g., the activation calculus used in the ACT-R models of game play). This is both a blessing and a curse. We make fewer assumptions about the underlying cognitive process; however, fewer assumptions also leader to leaner foundations for explaining the effects that emerge. Figure 4.3 Cumulative score for the default player against an opponent with different levels of reduced memory, where memory is the number of previous moves (own, opponent s) used to prompt recognition. We captured two of the main qualitative effects reported by West, Lebiere and Bothell (2006). The results are shown in Figure 4.3 First, the average difference in score between opponents of equal or near-equal memory capacity is near zero, the result of averaging across many individual random walks, where a random walk is a series of successive, randomly determined steps. (We have not yet investigated the distribution of outcomes to see whether the bimodal pattern reported in Lebiere, Wallach & West (2000) is present under all the conditions we tested.) Second, having relatively more memory confers a systematic advantage. Understanding how this happens is not straightforward. Lebiere, Wallach & West (2000) identify the ability to process a greater number of previous moves and the ability to detect sequential dependencies as the driver of this behavior. The smarter model can exploit the lesser model s poor ability to predict its behavior. In our case, we believe the advantage emerges because a larger cue set allows more finely-tuned pattern recognition, while the reduced-memory model suffers because it conflates past successes and failures. We are currently working to verify this explanation. Conversely, we have observed cases where a reducedmemory model converges on a successful strategy more quickly than larger-memory models when playing against Another unexpected result of this work was the realization that emergent behavior still requires some built-in constraints. Or, to put it another way, cognitive modeling will never be free of a certain amount of knowledge engineering, even for something as seemingly straightforward as simple game play. Whether this means encoding game strategy as procedural knowledge or understanding how to reinforce particular behaviors, there will always be a fair amount of creativity required to make a model go. 6. References Hintzman, D. L. (1984). MINERVA 2: A simulation model of human memory. Behavior Research Methods, Instruments & Computers, 16, Hintzman, D. L. (1986a). Judgments of Frequency and Recognition Memory in a Multiple-Trace Memory Model. Eugene, OR: Institute of Cognitive and Decision Sciences. Hintzman, D. L. (1986b). "Schema Abstraction" in a Multiple-Trace Memory Model. Psychological Review, 93, Klein, G. (1998). Sources of Power: How People Make Decisions. Cambridge, MA: The MIT Press. Lebiere, C., Wallach, D., & West, R. L. (2000). A memory-based account of the prisoner s dilemma and other 2x2 games. In Proceedings of the 3rd International Conference on Cognitive Modeling,

6 Lebiere, C. & West, R. L. (1999). A dynamic ACT-R model of simple games. In Proceedings of the 21st Conference of the Cognitive Science Society, Lumsden, M. (1973). The Cyprus conflict as a prisoner s dilemma game. Journal of Conflict Resolution, 17, WALTER WARWICK is a Senior Systems Analyst at Alion Science and Technology. He is working on several projects having to do with the modeling and simulation of human behavior. He received his Ph.D. in History and Philosophy of Science, an Area Certificate in Pure and Applied Logic, and an M.S. in Computer Science from Indiana University. Smith, R., Sola, M. & Spagnolo, F. (2000). The prisoner s dilemma and regime-switching in the Greek-Turkish arms race. Journal of Peace Research, 37, Warwick, W. (2006). A bad Hempel day: The decoupling of explanation and prediction in computational cognitive modeling. In Proceedings of the Fall 2006 Simulation Interoperability Workshop. Warwick, W. & Hutchins, S. (2004). Initial comparisons between a "naturalistic" model of decision making and human performance data. In Proceedings of the 13th Conference on Behavior Representation in Modeling and Simulation. Warwick, W., McIlwaine, S., Hutton, R. J. B., & McDermott, P. (2001). Developing computational models of recognition-primed decision making. In Proceedings of the 10th Conference on Computer Generated Forces. Warwick, W. & Santamaria, A. (2006). Giving up vindication in favor of application: Developing cognitively-inspired widgets for human performance modeling tools. Proceedings of the 7th International Conference on Cognitive Modeling. West, R. L., Lebiere, C., & Bothell, D. J. (2006). Cognitive architectures, game playing, and human evolution. In R. Sun (ed.), Cognition and Multi- Agent Interaction: From Cognitive Modeling to Social Simulation, New York: Cambridge University Press. Author Biographies AMY SANTAMARIA is a Research Cognitive Scientist at Alion Science and Technology. Her research focuses on modeling human behavior and cognition. She received her Ph.D. in Cognitive Psychology and Neuroscience and an M.A. in Cognitive Psychology from the University of Colorado Boulder.

The Evolution of Cooperation: The Genetic Algorithm Applied to Three Normal- Form Games

The Evolution of Cooperation: The Genetic Algorithm Applied to Three Normal- Form Games Scott Cederberg P.O. Box 595 Stanford, CA 949 (65) 497-7776 (cederber@stanford.edu) Abstract The genetic algorithm