Inference in Dynamic Environments Mark Steyvers & Scott Brown, UC Irvine

Size: px

Start display at page:

Download "Inference in Dynamic Environments Mark Steyvers & Scott Brown, UC Irvine"

Ethan Watts
5 years ago
Views:

1 Inference in Dynamic Environments Mark Steyvers & Scott Brown, UC Irvine 1 Introduction 1.1 Background Understanding how people make decisions, from very simple perceptual to complex cognitive decisions, is an important area of research in psychology. A great deal of effort has gone into finding empirical regularities of decision making behavior, and into developing models of underlying decision processes. A remarkable similarity of almost all previous research into decision making is the assumption of stationarity. With few exceptions (e.g., Kac, 1962; Rabbit, 1981; Triesman & Williams, 1984; Vickers & Lee, 1988, 2000), models of decision making assume that successive decisions are independent. This assumption probably cannot be true, but has proven useful in keeping models of decision making simple and tractable. Assuming stationarity also seems quite reasonable given that most every decision making experiment has employed a stationary decision making environment. Of course, real decision making environments are dynamic rather than stationary. Consider the case of a military observer making decisions about the identity (friend vs. enemy) of noisy stimuli from reconnaissance pictures. The difficulty of these decisions will change throughout the task, as more or less clear pictures are used, or more or less uniform terrain is observed. An ideal observer must dynamically adjust their decision making process to reflect changes in the environment. For example, if it becomes easier to identify friendly stimuli in new terrain, observers should relax their criterion for identifying enemy stimuli. Some prior research has attempted to address this classic criterion setting problem. Triesman and Williams (1984) developed a dynamic variant of signal detection theory (SDT) in which the decision criterion changed from decision to decision, based on the previous stimuli and responses. Vickers and Lee (1988, 2000) developed a similar idea for a much more complex decision making model. Our paradigm departs from previous work in some important ways, illustrated in Table 1. In terms of dynamics, the simplest kind of study is one in which the experimental design is static. A static (or stationary) design is one in which the properties of the task do not change during the experiment, hence participants do not need to change their behavior during the experiment in order to remain optimal. The critical feature of a static experiment is that the experimental conditions are drawn from a stationary distribution throughout the experiment, at least from the point of view of the participant. The upper left cell in Table 1 represents static experiments, in which the analyses are also static (i.e, no consideration of sequential effects, or the effects of task history). Experiments with between-subjects designs are typical of this category the task for a given participant does not change during the experiment, and there is no compelling reason to consider sequential effects. Static experiments are limited in their design, so researchers typically use dynamic experiments, meaning that experimental conditions change with time, forcing participants to adjust their decision making processes in order to remain optimal. The lower left cell of Table 1 represents dynamic experiments with static analyses: i.e., analyses that neglect the effect of task history and sequential dependencies. Static Analyses (No sequential effects.) Dynamic Analyses Static Experiment Between-subjects designs; early SDT experiments. Chaos, nonlinear dynamics, prior research into criterion setting. Dynamic Experiment Within-subjects blocked designs, including much of cognitive psychology research. Our research proposal. Table 1: Types of dynamic research

2 Research of this kind is very common in psychology: blocked designs are used, making the task dynamic, but static analyses are applied because researchers (often implicitly) assume that sequential dependencies between blocks are either unimportant or unmeasurable. This kind of research includes some of our own work (e.g., Heathcote, Brown & Mewhort, 2000; Shiffrin & Steyvers, 1997). Some researchers acknowledge that the effects of task history may be important, but still decide to use static analyses, and control for dynamic effects by censoring some data (e.g., from the start of each block). Static experiments (those with unchanging tasks) can also be subjected to dynamic analyses (top right cell of Table 1). These cases have so far represented almost all dynamic research, including most research into the presence of chaos or nonlinear dynamics in behavioral data (e.g., Kelly, Heath & Longstaff, 2001; Van Orden, Moreno & Holden, 2003), and previous seminal examinations of the criterion setting problem (e.g., Kac, 1962; Rabbit, 1981; Triesman & Williams, 1984). For example, in Triesman and Williams' research, the decision criterion was allowed to vary based on previous responses, but these variations were essentially spontaneous and unnecessary: an ideal observer in Triesman and Williams experiments would not have varied their decision making criterion at all. This is in contrast to many real-world decision making situations, where optimal behavior necessitates adjusting to ( tracking ) the changing environment. 1.2 Proposed direction of research Applying dynamic analyses to dynamic experimental designs (bottom right cell of Table 1) is a new direction we investigate in this proposal. The extra complexity introduced by having both dynamic models and experiments means that we are forced to keep both the experimental design and the modeling as simple as possible, to maintain tractability. We propose here new research that avoids both the assumptions of stationary decision processes and stationary decision environments. We develop an experimental paradigm in which dynamic decision making environments force participants to change their decision making processes in order to remain (approximately) ideal. This paradigm allows us to observe decision makers tracking changes in the environment. We also propose the continued development of two models for the decision process in dynamic environments. One model is an ideal observer system in which statistical evidence for a changed environment is weighed in optimal fashion against evidence for a stable environment. The ideal observer analysis results in estimates for the (optimal) number of trials it takes to detect and adjust to new decision environments. Our other model is a dynamic SDT model that estimates how long it actually takes for individual decision makers to adapt to novel decision environments. By comparing predictions from the ideal observer model to the parameter estimates from the decision model (from individual decision makers), we can quantify the degree of mismatch between ideal and actual observer. Taking the above reconnaissance picture example, a good observer would accurately track changes in the difficulty of the pictures, so it seems logical that the best person to select for this task is one who can change their decision environment as quickly as the ideal observer. It is also of interest to measure the direction of mismatch between ideal and actual observer. For example, some observers might change their decision processes too quickly or too late (compared to the ideal observer) given a change in the decision environment. A change in either direction will lead to suboptimal behavior. In tasks where there is a high penalty for falsely detecting a change, selection of good observers might favor late switchers. Similarly, if the task requires quick change detection with a severe penalty of detecting a change too late, selection of good observers might favor early switchers. 1.3 Overview of proposal In Section 2, we develop a new paradigm for studying dynamic decision making; this paradigm forms the basis for much of the research outlined in this proposal. In Section 3, we sketch a theory of change detection, based on the principles of ideal observer models. Section 4 gives an overview of our experiments and presents data from four completed experiments (over 400 participants) using the new paradigm, investigating individuals' abilities to track dynamic decision making environments. Section 4.2 develops a simple dynamic version of signal detection theory that we use as a measurement model, and shows its application to existing data. Section 4.3 develops our ideal observer model of change detection in greater detail, and illustrates its application by - 2 -

3 Difficult Decision Environment Easy Difficult Easy Ideal Observer decision parameter Subject s estimated decision parameter Lag between ideal and actual switch point Lag between subject s and ideal switch point Time Figure 1: Basic paradigm. Decision environments change (top) leading to changes in the ideal observer decision processes (middle). Participants actual processes (bottom) lag behind the ideal observer model. comparison with data from our preliminary experiments. Section 5 outlines a series of new experiments we propose to extend our research. These experiments are intended to have both basic scientific value (as investigations of decision making processes) and applied value (as new measurement tools). 2. Decision-making paradigm The decision paradigm we use is based on a simple two-alternative forced-choice design. In theory, any kind of decision stimuli could be used; for the first illustration we will continue with the reconnaissance picture example, but we describe many other examples below. To illustrate the paradigm, suppose there are a series of reconnaissance pictures, with a single target in each. The observer's task is to classify each target as an enemy stimulus or a friendly stimulus, based on physical characteristics (shape, color, markings, etc.). Two different decision environments could be defined by changing the properties of the enemy stimuli. In one environment, the enemy stimuli may look quite dissimilar from the friendly stimuli, making decisions relatively easy. In the other, enemy and friendly stimuli may be much more alike, resulting in relatively hard decisions. We can then construct a dynamic decision environment by alternating sequences of easy and hard decisions, as shown in the top panel of Figure 1. An ideal observer would have to raise the bar for what is considered an enemy stimulus in the hard condition, as enemy stimuli look very similar to friendly stimuli there, and they should lower the bar for what is considered an enemy stimulus when in the easy decision environment. As illustrated in the middle row of Figure 1, even our ideal observer model cannot immediately tell that a change in the environment has occurred. The ideal observer model must instead observe several pictures before it gathers enough statistical evidence to decide that the environment has changed, and that it must therefore change its decision making processes. As illustrated in the bottom part of Figure 1, an actual observer might deviate from the ideal observer by taking more time before adapting to the new environment. The comparison between actual and ideal observer results in a theoretically interesting and practically relevant measure: the lag between the switch point as estimated by the ideal observer and by the actual observer. An important property of our paradigm is that one stimulus class (the friendly stimuli, in this case) does not change during the experiment. Thus, any changes observed in response to those stimuli must be caused by changes in the decision processes of the observer. Continuing our reconnaissance picture example, if an observer raises their criterion in response to finding more difficult enemy stimuli, this should result in an observable increase in the number of friendly stimuli incorrectly identified as enemies. 3. Ideal Observers in Change Detection Tasks The paradigm outlined above is realistic in that participants are required to make a decision about every stimulus. They are thus forced to respond to changes in the environment only as a consequence of this task to maintain optimality in decision making, they must adjust to changes in the decision environment. This contrasts with other - 3 -

4 Signal Strength Signal Strength Signal Strength Signal Strength Actual Switch Point Time Best Hindsight Switch Point Early Switcher Hindsight error = -60 trials Ideal Switcher Hindsight error = +10 trials Late Switcher Hindsight error = +60 trials With all data, best hindsight switch point = 205 trials Figure 2. Illustration of switch point detection task, and ideal observer models. statistical research investigating change detection. Such research usually begins with the aim of detecting a change point in a complete sequence of data (e.g., by application of the Viterbi algorithm and estimation of a hidden Markov model). In our paradigm, participants must detect the change in a sequence as the sequence unfolds. We envisage the ideal observer model for our paradigm as follows. For simplicity, imagine that an observer is presented with only a sequence of numbers, perhaps representing the strength of sonar echo signals. They are told that, at exactly one point during the sequence, the distribution of these numbers will change, perhaps signaling the presence of a sonar target. Observers could operate by slowly gathering evidence about the distribution of the signals, and responding when they think this distribution has changed. This decision task is illustrated in Figure 2. The sequence of black circles represents a sequence of samples from the distributions. These are shown to the observer, one at a time. Note that the switch point is half way through the series at trial 200. The data in the first 200 trials sampled from N(0,1). The data in the last 200 trials are sampled from N(1.25,1). The observer s task is to identify when the change has occurred, as the sequence unfolds. That is, after each data point is presented, the observer is asked Has the switch point passed yet?, and they are allowed to answer yes only one time. An ideal observer model for this task may proceed using a nested model comparison framework. For the moment, suppose that the ideal observer knows the theoretical distribution of the non-signal data (occurring in the early parts of the sequence) and of the signal data (arriving after the switch point), but doesn t know the location of the switch point. After each data point arrives, the ideal observer could evaluate the evidence for two nested models. The first model is that the data so far have been generated by just one distribution (the non-signal distribution). The likelihood of the data under this hypothesis is easy to compute. The second hypothesis is that the data so far were generated from the non-signal distribution before some hypothetical switch point, and then from the signal distribution after that point. The likelihood of this hypothesis could be calculated by evaluating the likelihood separately for each possible switch point (from the beginning of the sequence to the current time) and choosing the most likely switch point. The ideal observer would then be in a position to respond that the switch had occurred if the likelihood of the switch model exceeded the likelihood of the one-state model by some criterion level. Under standard assumptions, twice the difference in log-likelihood values will be distributed as a χ 2 variable, so the criterion amount of difference required to detect a change would be determined by the ideal observer s desired Type I error rate. If the ideal observer has a very strict Type I error rate (e.g., p=.0001) they will only decide that a change has occurred when the two-state model has very much greater likelihood than the one-state model. An example is the late switcher in Figure 2. Conversely, if the ideal observer has a very lax Type I error rate (e.g., p=.2) they will decide that change has occurred when the twostate model fits only a little better than the one - 4 -

5 state model (e.g., the early switcher in Figure 2). Adjustments in the Type I error rate implement a trade off between the confidence that a change actually has occurred when a response is made, and the lateness of that response. The tradeoff between confidence and lateness can be resolved by appealing to the concept of hindsight. That is, at the end of the stimulus sequence, when all the data are available, the ideal observer can make a best hindsight estimate of the location of the switch point. The location of this estimate is more accurate than that made during the sequence see the vertical line marked best hindsight switch point in Figure 2. The ideal observer can then evaluate their own performance with the benefit of hindsight. That is, the actual response of the observer (i.e., the switch point indicated during the unfolding of the sequence) can be compared with the best hindsight estimate: we will refer to the difference as hindsight error. Then, the ideal level of evidence (i.e., Type I error rate) for detecting switch points could be calculated, by identifying the level that minimizes discrepancy between the actual response and the best hindsight value, over repeated sequences. In the example of Figure 2, this ideal level of evidence turns out to be p= For the particular example in Figure 2, the location of the switch point detected using this level of evidence is illustrated as the ideal switcher. We consider this ideal observer model for switch detection in greater detail below. We quantify the model, and compare the performance of ideal observers with performance of our participants in preliminary experiments. The correspondence between our preliminary experiments and the ideal observer analysis is not simple, since our experiments require more than just change detection from the participants. We therefore propose some additional experiments that provide more straightforward correspondence with the ideal observer model. 4. Overview of Experiments 4.1 Decision making situations So far, we have used just one decision making situation in our examples: classifying parts of pictures as friendly or enemy. Of course, this is not the only decision making example that is suitable for our work. We propose experiments using a variety of decision types, with the aim of generalizing our results across different levels of decision making processes, from low level sensory decisions to much higher level process. The experiments below include the following decision making situations: Lexical decision, i.e., deciding whether a character string is a valid English word or not. We manipulate decision difficulty by changing the wordiness of non-words (e.g., XXQZ is a less word-like non-word than STIP ). Numerosity judgment, i.e., deciding whether one symbol or another is more common in a given display. Our version of this classic task is designed to represent some aspects of realworld tasks, such as reading warning gauges (e.g., does the needle of a particularly noisy gauge point more to the left or the right?). We manipulate difficulty by changing how close to equal are the proportions of the two symbols. Recognition memory for real-world pictures and/or faces. Participants decisions are about whether pictures they are shown match ones they were previously instructed to remember. We manipulate difficulty by using distractor pictures that are more or less similar to those in memory. We plan to use both pictures of faces and pictures of outdoor scenery. Threat detection. Given only partial information about a possible assailant, participants must decide whether to respond with deadly or more moderate force. We manipulate decision difficulty by making the partial information more or less ambiguous. 4.2 Preliminary Experiments We have collected data from over 400 subjects in a preliminary series of three experiments. We aimed to study the way in which participants changed their decision criteria in response to environmental changes, on the trial-by-trial level. Before we began our experiments, it wasn't clear whether we would be able to observe these changes experimentally, as they could conceivably occur within one or two trials after the stimulus properties change. Therefore, our preliminary research aimed to collect sufficient data to quantify the nature of behavioral changes in response to stimulus changes by employing a decision environment as illustrated in Figure 3. The left and center panels of Figure 3 show how the stimulus environment changed: the expected value of the signal stimuli (top line in top left plot) were kept constant while the properties of the noise stimuli (lower line, left plot) were altered at random points

6 Experimental Sequence Predicted Criterion Predicted Hit/False Alarm Rates Figure 3. Conceptual design for Experiments #1-#3. The properties of the signal stimuli (triangles) are kept constant. The properties of the noise stimuli (circles) are varied. The left hand panel in Figure 3 shows the design of the experiment, with alternating noise stimulus properties and fixed signal properties. The middle panel shows the optimal decision criterion, and that the observer's decision criterion lags behind somewhat. This lag is a direct consequence of the difficulty in detecting the changes in stimulus properties. The right panel shows how false alarm and hit rates change under this model. The false alarm rates change both when the properties of the noise stimuli change, and also on a slower scale, as the observer adjusts their criterion. The hit rates change as well, and these changes can only be due to changes in decision criteria, as the signal properties are fixed throughout the experiment. Below, when we graph data and model fits from our experiments, we graph hit and false alarm rates as in the right panel of Figure 3. We average across stimulus context changes, so we show just one hard-to-easy context change and one easy-tohard change. These changes occurred either once only within each block (Experiment #1) or else between blocks (Experiments #2 and #3). Ideally, the decision task employed for these experiments would have stimulus changes that are not self-evident: if one were to present all red stimuli, then switch to all green stimuli, a participant's tracking of that change would be trivial. The decision tasks used in our experiments all fit the conceptual design in Figure 3. In Experiments #1 and #2 we used lexical decision where the subject decides whether a letter string forms an English word (signal) or a nonsense word (noise). By varying the 'wordiness' of the nonwords, we vary the signal strength of the noise distribution Rythmic Signal to Respond Procedure The LDT task produces bivariate response data: response latency (RT) and accuracy for each decision. When participants are forced to make decisions under great time pressure, without sufficient time to accurately weigh their options, variables such as speed-accuracy tradeoff (SAT) settings would be affected (McElree & Dosher, 1989). In order to accurately track dynamic changes in the amount of time available for decision making, participants would have to constantly adjust their SAT settings. Such adjustments could be modeled using decision boundaries in sequential sampling choice RT models (e.g., Heathcote & Brown, 2001; Mozer, Colagrosso, & Huber, 2002; Ratcliff, 1978; Usher & McClelland, 2001). However, No generally accepted method for dealing with the SAT exists because the definition of a suitable decision model is an unresolved question, and corrections for SAT are model dependent. To avoid such problems at a methodological rather than analytic level, we employed a new variant of the signal to respond procedure. This procedure restricted variability in response latency as much as possible, so that observed changes in accuracy were difficult to attribute to a SAT. The canonical signal to respond procedure presents participants with a stimulus and then, a certain amount of time later, presents them with a signal. Subjects are required to make their response within a certain window of time following the response signal

7 This procedure is often difficult for participants, resulting in much wasted data and occasionally ineffective control over the SAT. Our Experiments employed a variant of the signal to respond procedure, similar to the tempo naming procedure used by Kello and Plaut (2000, 2003; see also Steyvers, Wagenmakers, Shiffrin, Zeelenberg & Raaijmakers, 2001). This procedure sought to make responding within the acceptable window easier for participants by employing a rhythmic auditory cue. Our procedure involved keeping the rhythmic cue present and consistent throughout entire blocks. Specifically, a 256Hz tone sounded for 50msec every 400msec throughout Experiments #1-#3 (except for during between-block breaks). This rhythm anchored the temporal structure of the experiment for the participant. Stimuli were always presented on every (say) 10 th tone, and a response was always required between the two tones following the stimulus. Analysis of the RT data from this procedure confirmed that it helped subjects to maintain an accurate idea of when the responses are required, and to keep their response latencies tightly controlled Experiment #1 Lexical Decision with Random Switch Points Using our rhythmic procedure, the first experiment examined the effect of changing stimulus properties without informing participants. We used a standard lexical decision task (LDT) and kept the properties of the words constant across the whole experiment, while varying the wordiness of the non-words. We created non-words that were difficult to distinguish from words, by changing just one letter in seven-letter words, for example: SUBVIRT, LIBFARY and PETWIFY; these nonwords defined the difficulty decision environment. The easy decision environment was defined by non-words three letters different from words from seven-letter words, for example: FOMLERS, EPPAASI, and LERTINE. In Experiment #1 we changed the properties of the nonwords midway through blocks, at random points. This is not typical of psychological experiments, but we expected that in this design subjects would change their criteria relatively slowly, allowing us to better study those changes. Participants were not told that the decision environment would change during the experiment they were given only the usual instructions to respond as accurately as possible while keeping their response times within the acceptable response signal window. Our data from Experiment #1 (n=135) suggested that changing stimuli led to relatively long transients in hits and false alarm rates. The top left panel of Figure 4 graphs the data from Experiment #1. The vertical dashed line represents the point at which the decision context changed from either difficult to easy (open symbols) or easy to difficult (filled symbols). The units on the abscissa are trials within each block, with zero representing the point in the block where the properties of the nonwords changed. Note that this point was randomly distributed across blocks and the data are aligned to this point for graphing only. It is clear that the change in properties of the nonwords had large and rapid changes on the data from non-word stimuli, as expected. More interestingly, the data from word stimuli show changes even though these stimuli were unchanged throughout the experiment. These changes are shown in detail in the lower left panel of Figure 4, which graphs the mean difference in hit rates between the easy and hard decision environments. Naively, one would expect a higher hit rate in the easy environment, and hence a positive difference in this plot. Notice however, that after the environment changes the difference remains negative for around 10 trials. This blocking effect is well known from lexical decision tasks (e.g. Kinoshita & Lupker, 2002; Ratcliff Van Zandt & McKoon, 1999; Stone & Van Orden, 1993; Wagenmakers et al., in press), but the trial-by-trial dynamics have not previously been investigated. Interestingly, the change in the data from the word stimuli is gradual; on average it 14 trials for the cross over to occur. A gradual change strongly suggests that it is the long-term accumulation of task information that is driving this change. Below, we develop a simple measurement model to describe these data, based on a dynamic version of signal detection theory (SDT)

8 Exp. 1 Exp. 2 Exp. 3 Exp P( "target" ) HR( easy ) - HR( difficult ) Trials before and after switch Trials within blocks Trials within blocks Trials within blocks Figure 4: Data from Experiments #1-#4. In the top row: circles and triangles represent hit and false alarm rates respectively; filled and open symbols correspond to hard and easy decision environments respectively. The bottom row of plots show the difference between hit rates in easy and hard conditions, and how this changes during a block. The model fits shown in the top panels of Figure 4 are from dynamic variant of SDT outlined below. Note that the model captures the broad qualitative patterns in the data, such as the crossovers in both hit and false alarm rates, as well as steady changes in false alarm rates after the stimulus switch point Experiment #2 Lexical Decision with Blocked Stimulus Conditions Experiment #1 was interesting because it demonstrated that stimulus history has a strong effect on responses. Using random switch points within blocks reduces the realism of our design, as most experiments contain environment changes only between blocks. Researchers who wish to pool their data across blocks may well claim that the block break between blocks operates to minimize the effect of stimulus history. That is, the block break may cause subjects to reset their ideas of the structure of the stimuli somewhat. Experiment #2 tested this theory by replicating Experiment #1, except with the structure of the non-word stimuli changing precisely between blocks; that is, blocks always contained homogeneous stimuli, as is the usual practice. Additionally, since we did not need to have sufficiently long blocks to be able to randomly place stimulus switch points, we shortened the blocks from 100 trials to 40, and increased their number from 10 to 20. Once again, participants were not told that the experiment consisted of two different kinds of decision environment. The data from Experiment #2 (n=106) suggest that slow transient properties similar to those observed in Experiment #1 appeared even when changes in stimulus properties are synchronized with block breaks. Note that the hit rates (top middle panel of Figure 4) for easy and hard conditions look very similar at the start of the block, suggesting a criterion resetting during the block break. However, the hit rate for the difficult condition is statistically significantly higher than that for the easy condition, which suggests that the criterion at the start of the block is at least partially copied from the end of the previous block. This conclusion is supported by the graph of hit rate differences for Experiment #2, in the lower middle panel of Figure 4. Those data show that it takes an average of 10 trials (one quarter of each block) for the hit rates in easy and hard decision contexts to cross over to their asymptotic values, and that there are clear mirror effects whereby the hit rates change smoothly from easy to hard levels even though those conditions are blocked

9 4.2.4 Experiment #3 Lexical Decision With Prior Warning The results of Experiments #1 and #2 suggested that people take many trials to adjust their behavior after changes in the decision environment. Experiment #3 investigates whether the speed of this change can be influenced by conscious control. Experiment #3 used exactly the same design as Experiment #2, but with different instructions. Participants were explicitly told that there would be difficulty and easier blocks of trials, and that these would alternate. They were also shown examples of easy and difficulty nonwords. To ensure that they were aware of the current decision environment, throughout difficult blocks the words HARD BLOCK were displayed inside a large red patch on screen. During easy blocks, the words EASY BLOCK were displayed in a large green patch. During each block break, participants were reminded what type of block was coming next. The data from Experiment #3 are remarkable for their similarity to those from Experiment #2. Even with extensive knowledge of the experimental design, and forewarning, participants did not adjust to new decision environments noticeably quicker than in Experiment #2. There was still a clear crossover in hit rates during the early trials of each block (see bottom panel of Experiment #3 plots). This indicates that participants behavior during the early part of each block was strongly influenced by the previous block. The mean estimate of the time taken to adjust to the new decision environment (the lag parameter from our simple dynamic SDT model, below) was 14 trials. This indicates that participants took, on average, more than one third of each block to overcome the effects of task history from the previous block Experiment #4 Numerosity Experiments #1 and #2 both employed a lexical decision paradigm, leaving open the possibility that our results would not generalize to other decision tasks. Experiment #4 extends these results to a lower-level task: numerosity judgment. Classic numerosity judgment tasks (e.g., Palmeri, 1997) require decision makers to view patterns of symbols, typically dots, and decide how many symbols are in each. We used a categorization version of numerosity judgment, where participants always saw displays with exactly 10 symbols. The symbols were always of two types, and the participant's task was to decide which of the two symbols was in the majority. Our particular version of this task was constructed to emulate the process of reading a noisy gauge, or vibrating indicator needle. The two types of <<<><>><<> Figure 5: Example numerosity stimulus. Correct answer is left, as more symbols point left than right in this display. symbols we used were left and right pointing arrows, presented in a row, as in Figure 5. We manipulated decision difficulty by changing the distribution of proportions used: a display with 4/6 left/right elements is much more difficult than one with 2/8. We always had right-favoring display with only 4/6 or 3/7 proportions, and varied the left-facing displays from easy (9/1 or 8/2) to hard (7/3 or 6/4). To avoid biases, we reversed the left/right assignment for half of the participants. As in Experiment #2, switches between easy and hard decision environments only occurred between blocks. The data from Experiment #4 are graphed in the right-hand panels of Figure 4. Even though the numerosity task employed in Experiment #4 was very different to the lexical decision task used for Experiment #2, the data are very similar. A hit rate difference between easy and hard decision environments builds up slowly throughout each block, and blocks begin with a negative difference (bottom right panel of Figure 4) indicating that participants carried over their decision making processes between blocks; a sub-optimal strategy. 4.3 Simple Dynamic Measurement Model We have developed a simple dynamic version of signal detection theory (SDT) as a way to approximate our ideal observer model, with as few assumptions about the nature of the data as possible. Given a particular set of parameters, the ideal observer model discussed above leads to a distribution of switch detection points over repeated presentations of the same sequence. Our simple dynamic SDT model approximates this distribution with a single switch point. That is, we estimate a single lag for each participant, estimating how long they take (on average) to notice that an environment change has occurred. The simplicity of this model is especially - 9 -

10 Figure 6: Predictions from our dynamic SDT model. Left panel shows static SDT submodels. Right panel shows predicted criterion changes (top) and predicted hit rate and false alarm rate changes (bottom). important for its purpose as a measurement tool, as simplicity helps to decrease uncertainty in parameter estimates. This is crucial as we wish to interpret parameter estimates as meaningful measures of cognitive and behavioral phenomena at the level of individual decision makers. In its simplest form, our dynamic SDT model is designed to apply to decision-making tasks where there are two different decision-making environments that alternate throughout the task. The model is based on two static SDT models, one for each decision-making environment. For ease of explication, we assume that one decisionmaking environment is more difficult than the other. In that case, our model assumes that there is an SDT model operating in the difficult environment, defined by a sensitivity parameter (d' H, H for hard ) and a decision criterion (C H ) and another SDT model operating in the easy decision environment, defined by d' E and C E ( E for easy ). The crucial addition that allows us to model dynamic behavior is that we assume a lag when changing from one decision environment to the other. For example, when the decision environment changes from easy to difficult, we assume that the sensitivity of decisions changes immediately, from d' E to d' H. Such immediacy makes sense given that only the stimuli themselves define decision difficulty. By contrast, the decision criterion is under the control of the decision-maker, and thus will not change until they notice the change in decision environment, or some correlated variable (e.g., changed error rates). In our example, when changing from an easy to a hard decision environment, we assume that the decision criterion only changes from C E to C H after some lag, L. This assumption of a stepwise change in criterion may seem unreasonable. We examined other assumptions, such as a smooth exponential approach from the old to the new criterion, or a piecewise-linear approach, and found that they provided no significant improvement in fit. We chose the stepwise criterion change for its simplicity, and the interpretability of its parameter L (which simply measures the number of decisions after an environment change before the participant changes their criterion) Model Predictions Figure 6 makes the predictions of this model clearer. The SDT model illustrated in the top-left corner, t 1, illustrates behavior during easy decisions: note that d E is relatively large (the signal and noise distributions are relatively far apart) and that the criterion C E approximates the optimal criterion. This model leads to the hit rate (HR) and false alarm rate (FAR) predictions at the left of the right-hand panel, with high HR and low FAR. At time t 2, as shown in the lower right-hand panel, the decision environment changes from easy to hard. We assume that this change was implemented by increasing the similarity of the distractor stimuli to the signal stimuli, while keeping signal stimulus properties constant (so noise distributions change, but signal distributions don t). The SDT model then operating is shown in the lower left corner, with label t 2. Note that sensitivity has decreased due to the harder stimuli (d E has changed to d H ), but the criterion has not yet changed. This leads to the HR and FAR predictions shown under t 2 in the lower right-hand panel in dashed lines: no immediate change in HR, but a large increase in FAR. After some lag, L, the

11 decision-maker updates their criterion to C H, the appropriate criterion for hard decision environments (shown by the dashed line in the upper left plot). The SDT model then operating is shown as t 3 on the left hand side of Figure 6, and its predictions are shown by the dashed lines under the label t 3 in the lower right-hand plot: a decrease in both HR and FAR. Finally, the decision environment again changes, back to the easy condition, changing sensitivity but not immediately changing the decision criterion. This corresponds to SDT model t 4 and a predicted decrease in FAR, with no change in HR. Again, after some lag, the decision criterion is changed to C E, bringing us back to the SDT model t 1. As described, the dashed lines in Figure 6 show predictions for an individual subject s HR and FAR: note that our assumption of a stepwise criterion change implies only stepwise changes in predictions. In the text, we show fits to large groups of participants, where each participant is fit individually, and the observed and expected HR and FAR are averaged over subjects for the purposes of graphing only. Those graphs show smooth changes in FAR and HR, as illustrated by solid lines in the right-hand panels of Figure 6. Such smooth transitions are merely the result of averaging over many individual stepwise transitions, where the position of the step is variable Estimating Model Parameters We estimate the parameters of our model using maximum likelihood. Unlike static SDT models, where closed form analytic solutions are available, maximum likelihood parameter estimates must be obtained by search for the dynamic model. Conditional on a given value of the lag parameter, L, the model specifies one of four SDT models ( t 1, t 2, t 3, or t 4 above) as operating on every decision. These models naturally specify the probability of making each kind of response (hits, misses, false alarms and correct rejections), and so a likelihood for the parameter values under consideration can be obtained by multiplying probabilities across all trials in an experimental sequence. A practical difficulty is introduced as the lag parameter can take on only integer values, while the other parameters are continuous. We solved this problem by searching for maximum likelihood fits separately for every possible value of the lag parameter, and choosing the global maximum out of these (the lag parameter is constrained to be greater than zero, and less than the length of the longest decision environment) Unequal Variance Dynamic SDT A bonus from our use of a dynamic model is that we can also estimate an unequal variance version of our SDT model. This model has seven rather than five parameters, with two new parameters describing the ratio of variances in the signal and noise distributions, for hard and easy decision environments. A static SDT model for hard and easy decision environments (i.e., L=0, immediate criterion change) is unable to estimate unequal variance models, as there are effectively no data from models t 2 and t 4. We do not discuss the unequal variance model in the text, for two reasons: to maintain our focus on model simplicity, and because we have found that the unequal variance model provides no significant improvement in fit over the equal variance model. 4.4 Applied Ideal Observer Model The ideal observer model developed in Section 1.4 can be applied to data from our preliminary experiments. Naturally, there are degrees of idealization for such a model. We have chosen to idealize only the change detection aspect of these data, leaving levels of imperfection in the decision making aspect equal to those displayed by our participants. The alternative idealizing both aspects leads to a less interesting class of models with instant change detection and perfect decision making. Our choice represents an attempt to answer the question Given the observed imperfection in decision making, how quickly could an ideal observer reliably detect switch points?. Other ways of implementing an ideal observer model for our paradigm are certainly possible, and we will continue developing and testing different ideal observer models as our research progresses. We implemented the ideal observer model individually for each participant, as follows. Firstly, for each participant, measures of decision sensitivity and bias in both easy and hard decision environments (d E, d H, C E and C H ) were obtained from our dynamic SDT measurement model. These d values were then used to simulate data sequences. With these simulated sequences, and a given a particular level of evidence required for change detection (Type I error rate), the change points detected by an ideal observer model can be calculated, along with the hindsight change points,

12 premature switch detection. An example of a late switching distribution is also shown (p=.001). The ideal switching distribution is the one that most closely approximates the hindsight switch points, in a least-squares sense for this example, the ideal level of evidence was p=.010. The lower panel of Figure 7 shows the RMS difference between the observed switch point distributions and the hindsight values, as a function of the required level of evidence. The ideal level of evidence leads to an RMS difference between observed and hindsight switch points of only 15.3 trials. Figure 7. Example of applied ideal observer. as discussed above. For each participant, we identified the ideal required level of evidence as the one that minimized mean squared difference between ideal observer switch points and the hindsight switch points. Note that the ideal level of evidence changes across individuals, as decision behavior (sensitivity and bias) change. An example of this process is shown in Figure 7. For any data sequence (simulated using an individual participant s d and C values) we calculated the best hindsight estimate of the switch point, and the ideal observer s online switch points corresponding to 150 different evidence levels, logarithmically spaced between p=.63 and p= We repeated this exercise with 10,000 different random data sequences. This resulted in a distribution of observed switch points for each of the 150 confidence levels, and for the best hindsight switch points. Example distributions are shown in the top panel of Figure 7. The hindsight switch values are symmetrically distributed close to the true switch point (50). An example of an early switching distribution is shown, corresponding to a low level of evidence (p=.1) note that this level of evidence leads to some We compared our participants behavior to that of the ideal observer model by comparing the estimated switch detection lag for each observer (from the dynamic SDT model) with the mean of the switch detection lags from the corresponding ideal observer. Histograms across participants of these differences for Experiments #1 to #4 are shown in Figure 8. Larger values on the x-axes indicate greater sub-optimality from our participants (i.e., greater difference between observed switch detection lags, and mean ideal observer detection lags). Our participants were slower to detect the change in decision environments than the ideal observer models by an average of 17.5 trials in Experiment #1. In that experiment, our participants took an average of 21.8 trials to detect the change, whereas the corresponding ideal observer models took only 4.4 trials on average. Experiments #2 and #3 both employed lexical decision with stimulus switch points only occurring between blocks. In both these experiments, the mean difference between participants estimated switch lags and those of the ideal observer models was close to zero (less than one trial difference in each experiment). The correlation between ideal observer lags and the estimated lags for our participants was also high: r=.555 in Experiment #2 and r=.502 in Experiment #3. It seems that, when stimulus changes occur during block breaks, our participants were able to detect those changes as quickly as an ideal observer. A possible explanation that we are currently investigating is that participants were sub-optimal during the early parts of the experiments, but then caught on to the design of the stimulus sequences and were able to adjust more quickly in the late parts of the experiments. It is also interesting that the histograms for Experiments #2 and #3 are very similar, indicating

Figure 8. Optimality of change detection. that informing participants about the nature of the task (in Experiment #3) had little or no effect. Experiment #4 used a numerosity categorization task.

13 Figure 8. Optimality of change detection. that informing participants about the nature of the task (in Experiment #3) had little or no effect. Experiment #4 used a numerosity categorization task. In this task, participants estimated switch detection lags were actually smaller on average than the corresponding ideal observer lags (7.7 trials vs 12.5 trials, respectively). This result has several possible interpretations, which require further modeling and analysis to tease apart. One explanation is that participants in Experiment #4 were simply using a more liberal level of evidence for change detection than was ideal. This would lead to the very fast change detections shown in Figure 8, at the expense of some premature change detections and corresponding performance penalties. Another explanation is that participants in this task were using more information in their change detection processing than we allowed the ideal observer model. Our ideal observer model operated on constraints provided by the sensitivity and bias of the participants in the decision task. This task used a signal to respond procedure, and it is possible that participants continued to process stimulus information after giving their responses. This would provide them with greater information on which to base their change detection processing than just the information used to make decision task responses. 5 Proposed New Directions We propose several new research directions, based on the dynamic decision making paradigm and the dynamic SDT and ideal observer models developed above. These new directions include many new experiments, several of which are detailed below. Some of these experiments already have well developed methodological and stimulus-related details. Other experiments are less mature, and precise details of stimulus properties etc. are not finalized. These experiments are in areas where we feel that input from AFOSR researchers may be valuable in helping us to tailor the research towards Air Force applications. For example, one of the threat assessment experiments below is described as using quite unrealistic cartoon-like stimuli. We have chosen those stimuli because they allow tight experimental control, and parametric manipulations of factors such as similarity. However, in collaboration with AFOSR researchers, we can use precisely the same experimental design but with more realistic stimuli (e.g., real pictures, or satellite photos, rather than cartoons). We also plan to collaborate with AFOSR researchers on experimental aspects other than stimulus classes. For example, below we describe three experiments that assess the effects of fatigue, distraction and instruction on decision making. Input from AFOSR researchers may result in changing the emphasis of these experiments to other related variables (e.g., the effects of conflicting or inappropriate orders). We propose the development of research aspects other than experiments. We will continue to develop and refine the dynamic SDT model proposed above. One direction of development is to produce a hybrid model that incorporates elements of the dynamic SDT model and the ideal observer model. This model would then provide an actual generative mechanism for the detection of change in decision environments, rather than a solely descriptive account as at present. We propose the construction of a deliverable package from such a model. We envisage a software package that administers an experimental test to an individual (around 45 minutes) and then analyzes their data using our model. These analyses would return easily interpreted quantities, such as the lag parameter estimate, that describes how quickly that individual adjusts to dynamic environments, and the level of evidence required by a participant in order to detect change. Some of the experiments we propose would improve the managerial value of this system by providing normative data, and possibly correlations with existing psychometric measures

14 The feasibility of such a system is currently limited by the numerical difficulties inherent in estimating the parameters of the dynamic SDT model. It is possible for analyses of an individual data set (around 1000 observations) to take several hours on a reasonably fast workstation. We plan to fix this problem by developing estimation algorithms suitable for use on large-scale multiprocessor computers, such as exist at AFOSR. We have already begun this task, using the OpenMP (v2) standard for multi-threading compiler directives. Our goal is to develop software that analyzes data in real time: quickly enough that managers and team leaders can make use of the information. 5.2 Varying Instructions & payoffs Changing decision difficulty is not the only way that dynamic decision environments can be instantiated. For example, one experiment we propose below involves dynamically changing the payoff matrix for decisions. To make this concrete, imagine that incorrectly identifying a friend as an enemy (a false alarm) carries much greater penalties than incorrectly identifying an enemy as a friend (a miss). If these payoffs are dynamically altered, ideal decision makers will adjust their decision criteria appropriately, in order to minimize overall penalties. Changes in decision instructions can be naturally handled within our ideal observer model. Recall that the ideal level of evidence required to detect a change in that model (the Type I error rate) was calculated by minimizing the discrepancy between observed change points and those calculated with the benefit of hindsight. We defined discrepancy using a least squares error function. This function is symmetric the same penalty is given to early, premature, detection as to late detection. If the payoff matrix in a particular decision making environment were modified to be asymmetric (e.g., by penalizing premature change detection more heavily than late detection), the discrepancy function used to calculate the ideal Type I error rate could be changed accordingly. 5.3 Experiment #5: Simple Change Detection This experiment is designed for simple compatibility with our ideal observer model described above, allowing us to assess exactly how sub-optimal participants behavior was. To simplify the task as far as possible, we remove altogether the need for observers to make decisions on each trial. Instead, we provide observers with a sequence of data, and ask them to identify the switch point at which the properties of the data sequence change. To improve compatibility with our other experiments, and with our ideal observer model, participants are told that there are exactly two different states for a sequence, and that each sequence will begin in one state, and then switch to the other at exactly one point during the sequence. They will be given elements of the sequence one at a time, and asked to identify the change point as accurately as they can. Many different types of sequence can be employed. Participants could be shown the daily temperatures recorded in a (fictional) city, and told that, on one day during the month, a cold change lasting the rest of the month will arrive. With sufficient variability in daily temperature, this will be difficult to detect. Alternatively, participants could be shown a sequence of outcomes from rolling a pair of suspect dice, and asked to identify when the dice become less than fair. Another possibility is to show participants very noisy simulated radar pictures, with a very weak stimulus appearing at one point during the sequence. Participants are to identify that point as accurately as possible. An important aspect of these experiments is that participants are required to identify the point of stimulus change on line, while the stimuli are still being presented. They are also allowed to make only one response once they have indicated that the change has occurred they are unable to change their response. This situation provides compatibility with the ideal observer model, and with the dynamic SDT model. This constraint also provides a rationale for our definition of our ideal observer model as one that minimizes hindsight error. If an observer makes a too-hasty decision, they will discover this fact and have time to regret it while observing the passing of the remainder of the sequence. 5.4 Experiment #6a: Memory for Landscape Pictures Experiments #5a and #5b aim to generalize the results of our Experiments to more realistic stimuli (pictures of faces and outdoor scenery) and to higher level decisions (recognition memory). Recognition memory research has identified many mirror effects, similar in structure to that evident in the lexical decision data from our experiments. We will use a standard study-test recognition

Figure 9. Illustration of recognition memory stimuli (Experiment #6a). Easier stimuli (C and D) do not overlap, harder stimuli (A and B) do, and so are more confusable.

As in Experiments #1-#3, we will alter decision environments by alternating easy and difficult decision contexts.

15 Figure 9. Illustration of recognition memory stimuli (Experiment #6a). Easier stimuli (C and D) do not overlap, harder stimuli (A and B) do, and so are more confusable. memory design, using either pictures of human faces or fragments of landscape or pictures as stimuli. As in Experiments #1-#3, we will alter decision environments by alternating easy and difficult decision contexts. In Experiment #6a we will use landscape pictures and will construct easy and difficult distractor pictures by taking fragments of larger pictures and manipulating the amount of overlap between them. An example is given in Figure 9. The top half of that figure shows a landscape photograph with two fragments (A and B) cut out. These fragments share a great deal of overlap and so are very similar. The bottom row of Figure 9 shows another landscape picture with two fragments (C and D) cut out, but with less overlap and so lower similarity. If fragments A and C were in a decision maker's study list, and their task is to decide whether newly presented fragments are new or old (i.e., in the to-beremembered study list) we could create a very difficult decision making environment by presenting them with fragment B, which is very similar to the memorized fragment A. Conversely, we could ask them to decide whether fragment D is new or old, which would be much easier as it s less similar to items in memory. large database of black-and-white face pictures, from the FERET database. We have begun to pair these faces by subjective similarity rating. That is, we defined several major categories of face (by gender, age and ethnicity). Then, within each category, we identified pairs of very similar faces. For example, the top half of Figure 10 shows a pair of very similar faces from within one category (younger Asian women). These two faces are very similar to each other and so are highly confusable, like the A-B picture fragments of Experiment #6a. The bottom row of Figure 10 shows a random pairing from the same category: these two pictures are less similar, and so less confusable, like the C- D landscape fragments of Experiment #6a. 5.5 Experiment #7: Threat Assessment Experiment #7 continues the progression established in Experiments #5a and #5b towards more realistic decision making environments, including the assessment of threat level depicted in semi-realistic graphic displays. Those images could be of real people, or satellite reconnaissance photos, for instance. The decision making task used in this and all further experiments is to decide whether a stimulus presented in a noisy image represents a threat or not, and then respond accordingly. This task has several advantages over the tasks used above. Firstly, immediacy: the (imaginary) threat will be directed at the decision maker, which provides a 5.4 Experiment #6b: Memory for Human Faces We could perform the same recognition memory experiment with human face pictures as stimuli, but the similarity manipulation would be implemented differently. We have access to a Figure 10: Example faces for Experiment #6b. Top two faces are very similar, bottom two are less similar

The Dynamics of Experimentally Induced Criterion Shifts

The Dynamics of Experimentally Induced Criterion Shifts Journal of Experimental Psychology: Learning, Memory, and Cognition 2005, Vol. 31, No. 4, 587 599 Copyright 2005 by the American Psychological Association 0278-7393/05/$12.00 DOI: 10.1037/0278-7393.31.4.587