Hypothesis Testing: Strategy Selection for Generalising versus Limiting Hypotheses

Size: px

Start display at page:

Download "Hypothesis Testing: Strategy Selection for Generalising versus Limiting Hypotheses"

Sandra Mosley
6 years ago
Views:

1 THINKING GENERALISING AND REASONING, VERSUS 1999, LIMITING 5 (1), HYPOTHESES 67 Hypothesis Testing: Strategy Selection for Generalising versus Limiting Hypotheses Barbara A. Spellman University of Virginia, USA Alejandro López University of Hamburg, Germany Edward E. Smith University of Michigan, USA Humans appear to follow normative rules of inductive reasoning in premise diversity tasks that is, they know that dissimilar rather than similar evidence is better for generalising hypotheses. In three experiments, we use a hypothesis limitation task to compare a related inductive reasoning skill knowing how to limit hypotheses by using a negative test strategy. Participants are told that one category member has some property (e.g. Dogs have a merocrine gland) and are asked what evidence they would test to ensure that either all (generalisation) or only (limitation) category members have that property (e.g. All/Only mammals have merocrine glands; tests: wolf, bull, crocodile). Despite participants reluctance to use negative tests in the Wason task and other reasoning tasks, participants do use normatively correct negative tests in the hypothesis limitation task as often as they use diverse positive tests in the premise diversity task. Moreover, when given a hypothesis limitation task before a rule evaluation task (similar to the task), the use of negative tests increases. Thus, when testing hypotheses, people can and do use the right kind of test strategy for the task. Requests for reprints should be sent to: Barbara A. Spellman, Dept. of Psychology, University of Virginia, 102 Gilmer Hall, Charlottesville, VA 22903, USA. spellman@virginia.edu We would like to thank Lisa Stanton for assisting with data collection and analysis, and Patricia West for helpful discussions. Michael Doherty and Michael Gorman provided helpful comments on an earlier draft. Portions of this paper were presented as a paper at the Sixth Annual Meeting of the Southwest Cognition Conference (ARMADILLO), College Station, TX, May 1995; as a poster at the Third International Conference on Thinking, London, UK, August 1996; and as a paper at the 10th Annual Duck Conference on Social Cognition, Pine Island, NC, June Psychology Press Ltd

2 68 SPELLMAN, LÓPEZ, SMITH INTRODUCTION During most of the last 30 years, it has been fashionable among psychologists to show that human reasoning skills do not measure up to the normative models of rationality proposed by economists, statisticians, and philosophers (see Kahneman, Slovic, & Tversky, 1982, for a collection of early papers and Nisbett & Ross, 1980, for the classic review). However, one area in which human reasoning does seem to live up to the philosophical ideal is in respecting the diversity principle (see Osherson et al., 1990) the notion that hypotheses are considered to be stronger when supported by more diverse rather than more similar evidence (Carnap, 1950; Hempel, 1966; Nagel, 1939; Popper, 1962). In this article, we extend the investigation from how people use evidence to support or generalise their hypotheses, to another reasoning skill involving the relation between evidence and hypotheses: how people use evidence to narrow or limit their hypotheses. We believe that investigating this latter skill is relevant to comprehending performance in some of the so-called scientific reasoning tasks including the Wason (1960) rule-discovery (2-4-6) task. Premise Diversity in Generalisation People demonstrate respect for the diversity principle in several different inductive reasoning tasks. One is the evaluation of category-based inductive arguments. Osherson et al. (1990) had participants read two pairs of premises about different category members followed by the same conclusion about all category members. The participants task was to evaluate which was the better argument. For example, participants might have to choose between the following arguments: and Lions use norepinephrine as a neurotransmitter. Tigers use norepinephrine as a neurotransmitter. All mammals use norepinephrine as a neurotransmitter. Lions use norepinephrine as a neurotransmitter. Giraffes use norepinephrine as a neurotransmitter. All mammals use norepinephrine as a neurotransmitter. According to the diversity principle, the second argument is better because the conclusion is supported by more diverse evidence. About 75% of participants picked the (normatively correct) second argument. Another task in which participants demonstrate respect for the diversity principle is in the selection of evidence to support an argument (or hypothesis). In López s (1995) task, participants were told a fact about a given category member

3 GENERALISING VERSUS LIMITING HYPOTHESES 69 and asked what other category member they would examine in order to test the hypothesis that the fact is true of all category members. For example: Suppose you know for a fact that: Dogs have a merocrine gland. What additional mammal would you examine to test whether: All mammals have a merocrine gland? Choose one of the following: (a) wolf (b) bull According to the diversity principle, the normative answer is bull; that is, to generalise from dogs to all mammals one should choose to obtain information about other mammals that are as different from dogs as possible. López found that a majority of participants (62%) chose the normative answer in this task, and that the less similar a choice was to the initial category member, the more likely it was to be chosen. Thus, in both the evaluation of arguments and in the selection of evidence to support those arguments, it seems that most participants do understand the value of diverse evidence in generalising a hypothesis. The Hypothesis-limitation Task The goal of a reasoner, however, may be to limit a hypothesis, not to generalise it. To illustrate using the previous (dog) example, whereas someone might want to know whether all mammals had merocrine glands, someone else might want to know whether only mammals had merocrine glands, In this case, imagine the possible answers are: (a) wolf, (b) bull, and (c) crocodile. If we knew that dogs had merocrine glands and wanted to know whether only mammals had them, we should test the one animal that could give us relevant evidence: the non-mammal crocodile. A major focus of this article is the issue of whether people are as good at selecting evidence for this kind of inductive reasoning (i.e. limiting their hypotheses) as they are for generalising their hypotheses. The Wason Rule-discovery ( ) Task Previous research, particularly using the Wason (1960) rule-discovery ( ) task, might suggest that participants would be bad at the hypothesis-limitation task because it involves the use of negative tests (as described by Klayman & Ha, 1987). The Wason task is a widely used experimental paradigm, the results of which were initially interpreted as showing that (untrained) human scientific reasoning is flawed because people try to confirm rather than disconfirm their hypotheses. In the Wason (1960) task, participants are told that the experimenter has in mind a simple rule describing a particular relation between any three numbers,

4 70 SPELLMAN, LÓPEZ, SMITH and that the triad conforms to this rule. The participants task is to discover the experimenter s rule by proposing other test triads (sometimes along with the reasons for proposing them). For each test triad, participants are told whether or not it conforms to the rule. Participants are asked to announce the rule once they are convinced of having discovered it. They are then told whether or not the announced rule is correct, and, if it is wrong, are allowed to continue proposing triads, reasons, and rules. In the canonical version of the task, the simple rule that the experimenter has in mind is any ascending triad. When conducting experiments using this task, Wason (1960, 1962, 1968a; Wason & Johnson-Laird, 1972) and many others (e.g. Farris & Revlin, 1989; Gorman, Stafford, & Gorman, 1987; Klayman & Ha, 1989; Mahoney & DeMonbreun, 1977; Tweney et al., 1980) find that most participants form an initial narrow hypothesis such as the rule is triads ascending by two, and then test it by proposing other triads conforming to that hypothesised rule (e.g ). Because those triads also happen to conform to the experimenter s actual rule, the participants obtain confirming evidence for their hypothesised rule, and then often make an incorrect rule announcement. Most participants (at least initially before the first rule announcement) fail to test their hypothesised rule by proposing triads not conforming to it (e.g ). Given that many triads that fall outside the hypothesised rule also happen to conform to the actual rule, the participants would obtain disconfirming evidence for their too-narrow hypothesised rule, and, presumably, revise and expand it, and not make an incorrect rule announcement. Initially this tendency to test triads that conform to one s own hypothesised rule was referred to as confirmation bias (see Evans & Over, 1996, and Klayman, 1995, for reviews). The suspicion was that participants wanted to find data that confirmed their own initial hypotheses and this bias was pervasive in their generation of test items. The normative method of scientific inquiry against which this tendency was evaluated was the method of falsification proposed by the philosopher Popper (1959) and adopted by the psychologists who were evaluating human hypothesis-testing performance (e.g. Wason, 1960). 1 According to Popper, in order to best test one s hypothesis, one should attempt to falsify it. If the resulting data then confirm the hypothesis, those are more valuable data than data obtained from experiments that were merely trying to confirm the hypothesis. The results from many experiments were therefore interpreted as showing that humans were non-normative reasoners because they failed to try to disconfirm their hypotheses and merely tried to confirm them. 1 Popper (1959) believed that falsification both describes how science actually proceeds and is the way science ought to proceed. Many philosophers of science disagree with each of those ideas. For example, Kuhn (1970) argues that science proceeds through stages of slow accretion and rapid revolution. Quine (1961; Duhem, 1954/1906) and others believe that falsification is not merely nonnormative, but that it is essentially impossible.

5 GENERALISING VERSUS LIMITING HYPOTHESES 71 Klayman and Ha (1987) interpret these results as in a different manner. Participants, they argue, are not necessarily trying to confirm their hypotheses; it just looks that way in this particular version of the rule-discovery task because the hypothesised rule usually is a subset of the experimenter s rule and therefore all triads that conform to the former also conform to the latter. However, given other relationships between the two rules, proposing triads that conform to the hypothesised rule could, in fact, disconfirm the hypothesised rule. For example, if the hypothesised rule is triads ascending by two, but the experimenter s actual rule is the narrower rule of even triads ascending by two, the proposed triad would conform to the hypothesised rule but would also disconfirm it. Thus, although the participants strategy in theory could lead to finding disconfirming cases, they are just not getting any disconfirming cases in the canonical version of the task. Klayman and Ha (1987) argue that a better way of interpreting what participants are doing is that they are using a positive test strategy, in which cases conforming to a given hypothesis are tested, rather than a negative test strategy, in which cases not conforming to a given hypothesis are tested. Note that either strategy could lead to disconfirmation depending on the relation between the hypothesised and actual rules. The numerous results from the task can be interpreted in light of this strategy distinction. (See Klayman & Ha, 1987, for a description of the use of the positive test strategy in other reasoning tasks.) Task Differences The fact that participants often fail to use negative tests in the Wason task suggests that they might also fail to use negative tests (e.g. fail to pick crocodile ) in a hypothesis-limitation task. However, there are several differences between the Wason rule-discovery task and our task that might lead to differing performance in these tasks. First, the original task is a rule-discovery task and entails a dual generation process of producing both a plausible hypothesis (rule) and appropriate tests (triads) (see Klahr & Dunbar s, 1988, scientific discovery as dual search model of scientific reasoning). In López s (1995) premise diversity task and our hypothesis-limitation task, the hypothesis/rule is given (e.g. all mammals have a merocrine gland) and appropriate tests are provided (e.g. wolf, bull, crocodile); the participants need only evaluate which of those given tests is best for assessing the given hypothesis. Second, successful rule discovery demands more evidence than either successful generalisation or limitation, in that for rule discovery every exemplar is relevant to assessing the rule. In contrast, if one merely wants to discover whether all mammals have a merocrine gland, at worst one needs to test all mammals. Whether or not other types of animals (e.g. fish, insects, birds) also have a merocrine gland is irrelevant. If one merely wants to discover whether

6 72 SPELLMAN, LÓPEZ, SMITH only mammals have a merocrine gland, at worst one needs to test all nonmammals. Whether every different type of mammal (e.g. whales, bats, kangaroos) also has a merocrine gland is irrelevant. However, if one wants to discover the rule about which animals have a merocrine gland, one needs to test not only all mammals (to make sure that they do) but also all non-mammals (to make sure that they do not). Thus in the original task, testing the proposed hypothesis that the rule is triads ascending by two involves discovering not only that all triads ascending by two conform to the rule (promoting a positive test strategy) but also that only triads ascending by two conform to the rule (promoting a negative test strategy). Thus, both kinds of test and every exemplar are relevant to the rule-discovery task. Wason (1960) was quite aware of the two-sided nature of the rule-discovery task. What astonished him was that intelligent young adults would only be concerned with verifying that, in fact, all the exemplars conforming to their hypothesis did conform to the experimenter s rule (i.e. they made positive tests), but did not attempt to verify that only the exemplars conforming to their hypothesis did conform to the experimenter s rule (i.e. they did not make negative tests). Our experiments were intended to evaluate participants intentional use of negative tests to limit hypotheses and to compare that to their ability to use diverse positive tests to generalise hypotheses. Because the choice of test is relevant to performance in Wason s task, we also included a rule-like condition in which both positive and negative tests would be necessary for assessing a hypothesis. GENERAL METHOD The three experiments presented in this article are somewhat similar and so the basic research strategy is described here first. Differences are summarised in Table 1. Participants read a statement about evidence for an unknown rule (i.e. hypothesis). Depending on the condition All, Only, or Rule participants are then asked a particular question about what other evidence they would like to obtain to assess the rule. For example: Suppose you know for a fact that: The ordered triad of numbers conforms to an unknown rule about number triads called alpha. What triad would you choose to test whether or not: (All condition) All triads ascending by two conform to the alpha rule? or (Only condition) Only triads ascending by two conform to the alpha rule? or (Rule condition) The alpha rule is triads ascending by two?

7 GENERALISING VERSUS LIMITING HYPOTHESES 73 TABLE 1 Differences Between Experiments Experiment Content Questions Task 1 Animals & Numbers All Evaluation (within-subject) Only Rule 2 Numbers All-Rule Evaluation Only-Rule Rule 3 Numbers All-Rule Generation Only-Rule Rule The All condition is designed to be analogous to López s (1995) premise diversity task. Participants know the hypothesis in question, are given a piece of supporting evidence, and need to obtain other evidence to determine whether the hypothesis generalises to all triads ascending by two (other kinds of triads are irrelevant). The normative test strategy is a positive test strategy using a diverse test triad. The Only condition is designed to reveal whether participants will use a negative test strategy when it is the single normatively correct strategy. Again, participants know the hypothesis in question and are given a piece of supporting evidence, but in this condition they need to determine whether the hypothesis is limited to only triads ascending by two, by ruling out other kinds of triads. The normative test strategy is therefore a negative test strategy; that is, testing a triad that does not ascend by two to make sure it does not conform to the rule. By comparing performance in the All and Only conditions, we can compare participants understanding of the value of negative tests for limiting hypotheses to their understanding of the value of diverse positive tests for generalising hypotheses. The Rule condition is meant to capture some of the difficulty of the Wason task. Of course, it is not a rule-discovery task because participants are provided with the rule to evaluate and do not have to propose it for themselves. In addition, in the present task participants get to make just one test whereas in the task participants know they can make many tests. However, in the Rule condition, as in the original task, two kinds of evidence are relevant to verifying that a rule is correct using positive tests to make sure that all triads ascending by two do conform and using negative tests to make sure that all triads not ascending by two do not conform. By comparing performance in the Rule and Only conditions, we can determine whether participants understand the value of

8 74 SPELLMAN, LÓPEZ, SMITH negative tests but merely choose not to use them, at least as a first test, when both kinds of tests are required. EXPERIMENT 1 In Experiment 1, as just described, participants were given a piece of evidence (e.g ) and a hypothesised rule (e.g. the rule is triads ascending by two ), then, depending on condition (All, Only, or Rule), asked a question about what further evidence they would like to obtain. Participants were also provided with three test triads from which to choose (e.g , , 1-2-3). Of these three test triads, one was a similar positive test (i.e ), one was a dissimilar (or diverse) positive test (i.e ), and one was a negative test (i.e ) of the hypothesised rule. In this experiment, we thought that providing the test triads (in addition to providing the hypothesised rule) might lead participants into using negative tests in the Rule condition. We are not the first researchers to use this tactic: Kareev and Halberstadt (1993) also modified the original task by providing test triads; they demonstrated that when given test choices people sometimes do appreciate the benefits of negative testing. In two experiments, they had participants assess the protocol of an imaginary problem-solver working on the task. In their Experiment 1, participants were provided with an example triad (e.g ), the problem solver s hypothesised rule (e.g. the rule is triads where the second number is twice the first number, and the third number is three times the first number ), and four possible test triads including two positive tests (e.g ) and two negative tests (e.g ). The participants task was to evaluate how useful each test triad would be for discovering the rule. Overall, participants judged positive tests to be more useful than negative tests for discovering the rule. Experiment 2 was similar, but this time the participants were also provided with the experimenter s rule (e.g. the rule is triads of twodigit numbers ). For each of the two positive and two negative tests, one would falsify and the other would confirm the problem-solver s hypothesised rule. In this case, participants judged the negative tests that falsified the hypothesised rule to be the most valuable (with negative tests that confirmed the hypothesised rule to be the least valuable). Thus, people may appreciate the benefits of negative testing at least when they know the experimenter s actual rule and so know what the outcome of the test would be. In addition, Kareev and Avrahami (1995) have shown that people appreciate the benefits of negative tests and falsifying exemplars when creating a list of exemplars to teach someone else a rule. However, evaluating the worth of a particular test when you know the outcome of it (i.e. whether it is confirming or falsifying) may be quite different from evaluating the worth of a particular kind of test independent of the outcome. It is the latter that is of interest in most of this literature, and in this article.

9 GENERALISING VERSUS LIMITING HYPOTHESES 75 In addition to the differences between the present task and the canonical task mentioned earlier, there is one other notable difference between the way the premise diversity tasks and the task have been run in the past: the former have used familiar members of natural kind categories (e.g. mammals, fruits) whereas the latter has used numbers. Although it may seem odd to suggest that a mere difference in content could affect performance, there are many examples of formally equivalent reasoning tasks for which content does matter (most notably Wason s, 1966, 1968b, other famous task, the Wason selection task; see Holyoak & Spellman, 1993, for a review). Thus, in Experiment 1 we used both animals and numbers as content. Method Participants and Procedure A total of 144 University of Texas undergraduates participated in partial fulfilment of a course requirement. (The data from one additional participant were discarded because two answers were circled in response to one question.) Participants were tested in small groups of varying sizes. Each participant received a booklet containing the problems of interest along with other short reasoning problems. Participants were encouraged to take as much time as they needed to answer all questions. Design There were two independent variables: condition and content. Condition was tested between subjects; each participant was given either All, Only, or Rule questions. Content was tested within subject; each participant received two animal problems and two number problems. Half did the animal problems first, half did the number problems first. Which particular version of the problem went first was also counterbalanced across participants. In each booklet, the first two similar-content problems were followed by a short unrelated (filler) reasoning task, then by the other two similar-content problems. Materials The materials were adapted from López s (1995) premise diversity task; examples are presented here. (The full set of materials is shown in Appendix A.) Each version was on a separate page. Participants were told a piece of evidence and then asked what further evidence they would like to obtain to assess a particular hypothesis. In each condition participants were given one hypothesis to test for each item. (Unlike the examples shown here, participants saw only one question per item and did not see any of the italicised condition or test labels.) They were then told to circle one

10 76 SPELLMAN, LÓPEZ, SMITH of three possible answers, which in fact corresponded to: a positive test of the hypothesis that used evidence similar to that initially provided; a positive test of the hypothesis that used evidence dissimilar from that initially provided; or a negative test of the hypothesis. 2 For each version of each problem, the order of choices of answers was counterbalanced across participants. Animal Versions: Suppose you know for a fact that: Dogs have a merocrine gland. What organism would you examine to test whether or not: (All condition) All mammals have a merocrine gland? or (Only condition) Only mammals have a merocrine gland? or (Rule condition) A merocrine gland is a mammal property? Choose one of the following organisms (circle the answer): (a) wolf (positive test similar animal) (b) bull (positive test dissimilar animal) (c) crocodile (negative test) In the other animal version, the known fact was that hippopotamuses have an ulnar artery, the questions were about mammals having ulnar arteries, and the answer choices were: rhinoceros (similar), hamster (dissimilar), and robin (negative). Number Versions: Suppose you know for a fact that: The ordered triad of numbers conforms to an unknown rule about number triads called alpha. What triad would you examine to test whether or not: (All condition) All triads ascending by two conform to the alpha rule? or (Only condition) Only triads ascending by two conform to the alpha rule? or (Rule condition) The alpha rule is triads ascending by two? Choose one of the following triads (circle the answer): (a) (positive test similar triad) (b) (positive test dissimilar triad) (c) (negative test) In the other number version, the known fact was that the ordered triad of numbers conforms to an unknown rule about number triads called beta, 2 We did not give participants the choice between similar and dissimilar negative tests. It seems that whereas a dissimilar positive test is the normative test for the All question, a similar negative test would be the normative test for the Only condition.

11 GENERALISING VERSUS LIMITING HYPOTHESES 77 the questions were about triads descending by four, and the answer choices were: (similar), (dissimilar), and (negative). Results and Discussion Scoring Using Consistent Strategies For each content, we determined which participants selected the same test strategy for the two versions (i.e. the two animal problems or the two number problems). These consistent participants were scored as similar if they picked the positive test/similar evidence answer for both versions; dissimilar if they picked the positive test/diverse evidence answer for both versions; and negative if they picked the negative test for both versions. We used this procedure to reduce the effects of guessing. The number of consistent participants in each condition is presented in the leftmost numerical column of Table 2. Overall, 81% of the time participants used a consistent strategy; the percentage did not vary much across conditions (77% to 85%). Performance by Condition In the All condition, most of the consistent participants selected the normative answer: positive test/dissimilar evidence. The percentages were nearly equal in the number and animal problems, suggesting that many participants understand TABLE 2 Experiment 1: Percentage of Consistent Participants Using Each Strategy in Each Condition Strategy Positive test Positive test Negative test Condition N (similar) (dissimilar) Number All Only Rule Animal All Only Rule N = the number of participants out of 48 using a consistent strategy. Bold indicates normative test strategy in the All and Only conditions. For the number problems, based on actual frequencies, x 2 (4, N = 118) = 36.79, P < For the animal problems, based on actual frequencies, x 2 (4, N = 115) = 37.13, P <.001.

12 78 SPELLMAN, LÓPEZ, SMITH the importance of premise diversity when generalising a hypothesis in either domain. (Of course, they were the same participants in both domains.) In the Only condition, most of the consistent participants selected the normative answer: negative tests. Again, the percentages were similar in the number and animal problems, suggesting that participants understand the importance of negative tests when limiting a hypothesis in either domain. More importantly, the proportion of consistent participants using negative tests in the Only condition was not significantly different from the proportion using dissimilar tests in the All condition [for number, x 2 (l, N = 78) = 0.32, ns; for animal, x 2 (1, N = 76) = 1.39, ns]. Thus, many participants will use negative tests when they are the single normatively correct answer. The Rule condition shows the most strategy variability: no strategy was used by a majority of the participants. (Although this variability may seem odd given the usual homogeneity described previously, the present heterogeneity may be attributable to the present method: here participants selected which of the given evidence to test, usually participants must generate their own evidence to test.) For both contents, about 40% of the consistent participants selected negative tests; in the number problems most of the others selected similar tests whereas in the animal problems most of the others selected dissimilar tests. The proportion of consistent participants using negative tests in the Rule condition is significantly less than the proportion using negative tests in the Only condition [for number, x 2 (1, N = 77) = 4.68, P <.05; for animal x 2 (1, N = 78) = 12.04, P <.001]. Therefore, the lack of negative tests in the Rule condition is not due to a failure to appreciate the evidentiary value of such tests (as shown by the results in the Only condition); rather, the variety in strategy selection in the Rule condition is our first piece of evidence that participants (correctly) interpret the Rule condition as requiring both kinds of tests. Content Effects Performance on the animal and number problems is not actually as similar as it appears to be in Table 2. When analysed by content order, the Rule condition shows some asymmetric transfer effects. Although interesting, these effects are not relevant to the current analysis; they are documented and described in Appendix B. EXPERIMENT 2 Experiment 2 had two main goals. First, we wanted to replicate the results of Experiment 1 using more participants. Because of the asymmetric transfer effects, we really only had uncontaminated data from 24 participants in each condition. Therefore, in Experiment 2 we used just the number problems.

13 GENERALISING VERSUS LIMITING HYPOTHESES 79 Second, and more importantly, we wanted to further investigate the variability of answers in the Rule condition. As we have argued, when discovering or evaluating a rule, one needs to find both all of the items included by the rule and all excluded by the rule. The former suggests a positive test strategy (preferably one using dissimilar tests) whereas the later suggests a negative test strategy. Yet in the number problems, the plurality of the consistent participants in the Rule condition used a similar test that is, did not follow either optimal strategy even though many participants showed that they understood the value of dissimilar tests (in the All condition) and negative tests (in the Only condition). In Experiment 2 we thought we could increase the participants use of optimal tests in the Rule condition by priming them to realise the value of one of those (i.e. either dissimilar or negative) tests. Various attempts have been made to try to foster the use of negative tests on the original task. Because participants who propose more negative tests are more likely to discover the rule (e.g. Wason, 1960), researchers thought that fostering the use of negative tests would facilitate performance. In one line of research (e.g. Gorman et al., 1987; Tweney et al., 1980; Wharton, Cheng, & Wickens, 1993), participants are told that they have to guess two rules that the experimenter has in mind (the target rule being the rule). When participants are informed that the two rules are complementary (i.e. that all triads conform to exactly one of the rules), the number of correct first rule announcements doubles (to 43% in Wharton et al., 1993). Why? When participants make positive tests of one of the rules, they are making negative tests of the complementary rule. Those tests give them information that will allow them to disconfirm overly narrow hypotheses. A second line of research found that under some conditions, explicitly instructing participants to use negative tests will greatly increase the use of such tests and the chances of success at the task (Gorman & Gorman, 1984; see Evans, 1989, for a review). In a third line of research, more similar to the present, Kareev, Halberstadt, and Shafir (1993) found that participants who passively see negative test triads provided by the experimenters during a practice problem will use negative tests more often on a subsequent problem than participants who have to generate their own triads during the practice problem. In Experiment 2 we wanted to find out whether we could foster the use of dissimilar and negative tests by having participants actively select such tests on an initial task (All or Only, respectively); participants might then transfer the selected strategy to the Rule task. Accordingly, in Experiment 2 one-third of the participants answered All questions before answering Rule questions; one-third answered Only questions before answering Rule questions; one-third answered just Rule questions. If the initial questions reveal the value of the test strategy for the Rule condition, then participants who answer All questions first should be more likely to use dissimilar tests when later answering Rule questions, whereas

14 80 SPELLMAN, LÓPEZ, SMITH participants who answer Only questions first should be more likely to use negative tests when later answering Rule questions. Method Participants and Procedure A total of 144 University of Texas undergraduates participated in partial fulfilment of a course requirement. (The data from five additional participants were discarded: four circled two answers in response to a single question; one claimed to not understand the experiment.) None had participated in the previous experiment. The procedure was the same as in Experiment 1. Design There were three between-subjects conditions: Rule, All-Rule, and Only- Rule. (Content was dropped as a variable.) In the Rule condition, participants answered just the Rule questions for both versions of the number problem (i.e and ). In the All-Rule and Only-Rule conditions, participants answered the All (or Only) questions for both versions, then did a short unrelated (filler) reasoning task, then answered the Rule questions for both versions. 3 Half of the participants did the version or versions first; the other half did the version or versions first. Materials We used the two versions of the number problem from Experiment 1. The only change we made was that for the All and Only questions we put the words ALL and ONLY in bold capital letters so that participants in the All-Rule and Only-Rule conditions, respectively, would be sure to realise that those first questions were different from the following Rule questions. Results and Discussion Overall, 75% of the time participants used a consistent strategy. For the questions they answered first, 70% of the participants were consistent; for the Rule questions when answered second, 81% of the participants were consistent. (The number of consistent participants is given in the leftmost numerical column of Table 3.) Participants who answered the Rule questions second were marginally 3 Although it appears we are confounding condition and practice in that we are comparing Rule questions answered first (in the Rule condition) to Rule questions answered second (in the All-Rule and Only-Rule conditions), we are mostly interested in the difference between the latter conditions. 4 Note that there is no increase in consistency when participants answered Rule questions following other Rule questions in Experiment 1 (see the bottom of Table 5).

15 GENERALISING VERSUS LIMITING HYPOTHESES 81 TABLE 3 Experiment 2 (Evaluation): Percentage of Consistent Participants Using Each Strategy in Each Condition Strategy Positive test Positive test Negative test Condition N (similar) (dissimilar) Questions Answered First All Only Rule Rule Questions When Second All-Rule Only-Rule N = the number of participants out of 48 using a consistent strategy. Bold indicates normative test strategy in the All and Only conditions. For the top three conditions, based on actual frequencies, x 2 (4, N = 101) = 50.45, P <.001. For the bottom three conditions, based on actual frequencies, x 2 (4, N = 110) = 37.53, P <.001; for the bottom two conditions, x 2 (2, N = 78) = 6.88, P <.05. more likely to be consistent than participants who answered the Rule questions first, x 2 (1, N = 144) = 3.77, P = Questions Answered First The top of Table 3 contains the results for the questions answered first (All, Only, or Rule). These results amount to a replication of the All, Only, and Rule conditions of Experiment 1 when the number problems were done first (see Appendix B). Overall, participants showed different patterns of answers across the three types of questions (see Note to Table 3). As in Experiment 1, for the All questions, most of the consistent participants (54%) again selected the normative answer of the dissimilar positive test. For the Only questions, most of the consistent participants (68%) again selected the normative answer of the negative test. Again, the proportion using negative tests for the Only questions was not significantly different from the proportion using dissimilar tests for the All questions, x 2 (1, N = 69) = 1.29, ns. For the Rule questions, when answered first, most of the consistent participants (76%) selected the positive similar test. The smaller percentage of participants selecting the negative tests (relative to Experiment 1) is concordant with the Experiment 1 results from participants who did the number problems first (see Appendix B). Again, the proportion of consistent participants using negative tests in the Rule condition is significantly less than the proportion using negative tests for the Only questions, x 2 (1, N = 66)

16 82 SPELLMAN, LÓPEZ, SMITH = 20.74, P <.001. Thus, even though many participants will select negative tests in response to an Only question, far fewer do so for the Rule questions. Rule Questions When Answered Second Participants performance on the Rule questions depends on whether the Rule questions are answered first or answered after the All or Only questions. The pattern of strategy use for the Rule questions varied overall across the Rule, All- Rule, and Only-Rule conditions; in particular, the pattern differed between the All-Rule and Only-Rule conditions (see Note to Table 3). Answering the All questions first significantly increased the use of dissimilar tests relative to the Rule baseline condition, x 2 (1, N = 71) = 13.17, P <.001. Similarly, answering the Only questions first significantly increased the use of negative tests relative to the Rule baseline condition, x 2 (1, N = 71) = 11.81, P <.001. The fact that people will change strategies when answering Rule questions when other appropriate strategies are primed (see also Kareev et al., 1993), is our second piece of evidence that participants (correctly) interpret the Rule condition as requiring both kinds of tests. Summary of Experiments 1 and 2 In Experiments 1 and 2 we have seen that people will often use negative tests to limit a hypothesis when (a) they simply have to evaluate which presented evidence is best, and (b) the test is phrased in terms of an Only question. However, when the task is to assess a rule, people do not often use negative tests even when such tests are provided to them. Instead, for the Rule questions, people use a variety of test strategies, which can be altered by prior context. To the extent that our Rule task is similar to the task, it could be argued from our data that at least part of the failure to use a negative test strategy in the original task may be due to the Rule condition requiring both kinds of tests. However, the failure to use a negative test strategy in the original task may also be due (in part) to the fact that participants have to generate their own test triads. We address this issue in our next experiment. EXPERIMENT 3 The purpose of our third experiment was to determine whether strategy use would be different if participants had to generate their own evidence (as they do in the original task) rather than merely evaluate given evidence (as in Experiments 1 and 2). To examine this, we designed Experiment 3 to be identical to Experiment 2 in all respects except for its generation format: participants were asked to produce a single triad themselves in order to test a (given) hypothesised rule.

17 GENERALISING VERSUS LIMITING HYPOTHESES 83 It seems that the generation task should be harder than the evaluation task. Participants might be less likely to spontaneously think of the correct test strategy than they were when the assortment of answers suggested various potential test strategies. In addition, in the absence of alternative answers (i.e. potential evidence), participants might be more strongly primed by the initial evidence and so be more likely to generate something similar to it. As a result, we would expect fewer consistent participants and fewer normatively correct participants for the All and Only questions. For the Rule questions, if it is merely the necessity of both kinds of test that makes the rule task difficult, participants should perform no differently in the generation format of Experiment 3 than they did in the evaluation format of Experiment 2. If, however, the generation format also adds to the difficulty of the Rule questions, we would surmise that it also adds to the difficulty of the task. Method Participants and Procedure A total of 144 University of Texas undergraduates participated in partial fulfilment of a course requirement. (The data from five additional participants were discarded: three skipped questions; one claimed to not understand the experiment; one wrote that the problems could not be solved by proposing only one more triad.) None had participated in either of the previous experiments. The procedure was the same as in the previous experiments. Design The design was identical to Experiment 2. The conditions again were: Rule, All-Rule, and Only-Rule. Materials The materials were identical to Experiment 2 with one change. As in Experiment 2, participants were asked what triad they would choose to test the particular question, but instead of being provided with a choice of triads they were told to generate one themselves and write it down. Three blank lines separated by dashes were supplied. Results and Discussion Scoring For all questions, with respect to the hypothesised rule and the example triad (i.e ), producing a test triad ascending by two constitutes a positive test of the rule, whereas producing any other test triad constitutes a negative test. We distinguished similar from dissimilar positive tests in the following way: if the

18 84 SPELLMAN, LÓPEZ, SMITH triad was even and ascended by two it was considered a similar test; all other triads ascending by two were considered dissimilar tests. Although the line between similar and dissimilar seems somewhat arbitrary, it is parallel to the answers provided in Experiment 2. Thus triads such as 4, 2, 0 or 1002, 1004, 1006 which could be thought of as dissimilar were coded as similar, whereas triads like 1, 3, 5 were coded as dissimilar. Dissimilar triads also included triads like 108.5, 110.5, and x-2, x, x+2. 5 Consistent Participants Overall, 67% of the time participants used a consistent strategy. For the questions they answered first, 69% of the participants were consistent; for the Rule questions when answered second, 64% of the participants were consistent. The number of consistent participants is given in the leftmost numerical column of Table 4. TABLE 4 Experiment 3 (Generation): Percentage of Consistent Participants Using Each Strategy in Each Condition Strategy Positive test Positive test Negative test Condition N (similar) (dissimilar) Questions Answered First All Only Rule Rule Questions When Second All-Rule Only-Rule N = the number of participants out of 48 using a consistent strategy. Bold indicates the normative test strategy in the All and Only conditions. For the top three conditions, based on actual frequencies x 2 (4, N = 99) = 32.43, P <.001. For the bottom three conditions, based on actual frequencies, x 2 (4, N = 92) = 17.99, P <.01; for the bottom two conditions, x 2 (2, N = 61) = 9.91, P < For the other question (beta rule; ), in all conditions, with respect to the hypothesised rule and the example triad, producing test triads descending by four constitutes a positive test of the rule, whereas producing any other test triad constitutes a negative test. We delineated similar from dissimilar positive tests again using odds and evens; for this rule, if the triad was odd and descended by four it was considered a similar test; all other triads descending by four were considered dissimilar tests.

19 Questions Answered First GENERALISING VERSUS LIMITING HYPOTHESES 85 The top of Table 4 shows the results of Experiment 3 for the questions answered first. Overall, participants showed different patterns of answers across the three types of questions (see Note to Table 4). For the All questions, almost half of the consistent participants (47%) generated the normative answer of a dissimilar positive test. For the Only questions, just over half of the consistent participants (53%) generated the normative answer of a negative test. Again, the proportion using negative tests for the Only questions was not significantly different from the proportion using dissimilar tests for the All questions, x 2 (1, N = 68) =.24, ns. For the Rule questions, when given first, most of the consistent participants (55%) generated a positive similar test. Again, the proportion of consistent participants using negative tests in the Rule condition is significantly less than the proportion using negative tests for the Only questions, x 2 (1, N = 63) 16.28, P <.001. These results, therefore, show the same general pattern as the previous experiments. Rule Questions When Answered Second Participants performance on the Rule questions again depends on whether the Rule questions are answered first or after the All or Only questions. The pattern of strategy use for the Rule questions varied across the Rule, All-Rule, and Only- Rule conditions; again, in particular, the pattern differed between the All-Rule and Only-Rule conditions (see Note to Table 4). Unlike Experiment 2, however, answering the All questions first did not significantly increase the use of dissimilar tests relative to the Rule baseline condition, x 2 (1, N = 65) =.002, ns. This result could be attributed to the imprecision in distinguishing similar from dissimilar tests (as discussed in the previous section entitled Scoring). As in Experiment 2, answering the Only questions first significantly increased the use of negative tests relative to the Rule baseline condition, x 2 (1, N = 58) = 13.09, P <.001. (There is no imprecision in deciding what counts as a negative test.) Comparing Experiments 2 and 3 The generation version of the task looks more difficult (although not significantly so) across many measures. When measuring consistency: for questions answered first, the same proportion of participants were consistent in Experiments 2 (70%) and 3 (69%), x 2 (1, N = 228) =.07, ns; however, for the rule question when answered second, fewer participants were consistent in Experiment 3 (64%) than in Experiment 2 (81%), x 2 (1, N = 192) = 7.53, P <.01. When measuring use of the normative strategy: for the All questions, consistent participants in Experiment 3 were less likely to use the normative dissimilar test strategy than those Experiment 2 [47% vs. 54%; x 2 (1, N = 71) =.35, ns];

20 86 SPELLMAN, LÓPEZ, SMITH similarly, for the Only questions, consistent participants in Experiment 3 were less likely to use the normative negative test strategy than those in Experiment 2 [53% vs. 68%; x 2 (1, N = 66) = 1.46, ns]. For the Rule questions (summing across when they were answered first and second), consistent participants in the generation version were also less likely to use a negative test strategy than those in the evaluation version [21% vs. 30%; x 2 (1, N = 202) = 2.29, P =.13, ns]. Note that although the generation task seems more difficult than the evaluation task, the increase in difficulty is comparable across question-type. In particular, in both Experiments 2 and 3, the consistent participants more often used negative tests for the Only questions than for the Rule questions but the size of the advantage was not different across the two experiments. That is, given the question (i.e. Only or Rule), consistent participants did not differ across experiments in the proportion of times they used negative tests, x 2 (2, N = 129) = 2.12, ns. 6 CONCLUSIONS Our three experiments demonstrate that many people will use a negative test strategy in the hypothesis-limitation (i.e. Only) task. That strategy is used both when evaluating and generating evidence. In fact, people use that strategy as frequently as they use a diversity strategy in the premise diversity (i.e. All) task. Of course, not all participants use the normative negative test strategy; however, the proportion who do is far greater than expected given the previous findings regarding the failure to use negative tests in the standard Wason task. In contrast to our other findings, in our Rule task participants did not often use negative tests, although the use of such tests is clearly one optimal strategy. It seems that sometimes participants (correctly) interpret the Rule task as requiring both (diverse) positive and negative tests: one piece of evidence for that is the variability of test use in Experiment 1; a second is the effectiveness of priming test strategy in Experiment 2. Participants had even more difficulty coming up with negative tests when they had to generate such tests in Experiment 3. Both of these factors the necessity for both kinds of tests and the difficulty of generation may contribute to the lack of use of negative tests in the original Wason task. Of course, unlike the Wason task, our tasks involve only a single trial; thus, we cannot attempt to explain the lack of negative tests in subsequent trials after the first. However, it should also be recalled that the original Wason task is a rule discovery task, that is, it entails a dual generation process of producing both the hypothesised rule and the test triads, whereas in (1989). 6 The statistical test is for conditional independence with two degrees of freedom. See Wickens

When Falsification is the Only Path to Truth

When Falsification is the Only Path to Truth Michelle Cowley (cowleym@tcd.ie) Psychology Department, University of Dublin, Trinity College, Dublin, Ireland Ruth M.J. Byrne (rmbyrne@tcd.ie) Psychology Department,