o^ &&cvi AL Perceptual and Motor Skills, 1965, 20, Southern Universities Press 1965

Similar documents
Goodness of Pattern and Pattern Uncertainty 1

Technical Specifications

Incorporating quantitative information into a linear ordering" GEORGE R. POTTS Dartmouth College, Hanover, New Hampshire 03755

Effects of Sequential Context on Judgments and Decisions in the Prisoner s Dilemma Game

A NOTE ON THE CORRELATIONS BETWEEN TWO MAZES

SUPPLEMENTAL MATERIAL

Discrimination Weighting on a Multiple Choice Exam

SELECTIVE ATTENTION AND CONFIDENCE CALIBRATION

Comparing Direct and Indirect Measures of Just Rewards: What Have We Learned?

1 Diagnostic Test Evaluation

ELEMENTS OF PSYCHOPHYSICS Sections VII and XVI. Gustav Theodor Fechner (1860/1912)

Lecturer: Rob van der Willigen 11/9/08

Lecturer: Rob van der Willigen 11/9/08

COMPUTING READER AGREEMENT FOR THE GRE

Chapter 11. Experimental Design: One-Way Independent Samples Design

Examining differences between two sets of scores

' ( j _ A COMPARISON OF GAME THEORY AND LEARNING THEORY HERBERT A. SIMON!. _ CARNEGIE INSTITUTE OF TECHNOLOGY

Chapter 12. The One- Sample

..." , \ I \ ... / ~\ \ : \\ ~ ... I CD ~ \ CD> 00\ CD) CD 0. Relative frequencies of numerical responses in ratio estimation1

Are Retrievals from Long-Term Memory Interruptible?

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

MODELS FOR THE ADJUSTMENT OF RATING SCALES 1

Misleading Postevent Information and the Memory Impairment Hypothesis: Comment on Belli and Reply to Tversky and Tuchin

Color naming and color matching: A reply to Kuehni and Hardin

Spectrograms (revisited)

Behavioural Processes

Judging the Relatedness of Variables: The Psychophysics of Covariation Detection

ui lort-term MEMORY FOR COMPLEX MEANINGFUL VISUAL CONFIGURATIONS: A DEMONSTRATION OF CAPACITY 1

Underestimating Correlation from Scatterplots

Pigeons and the Magical Number Seven

The influence of irrelevant information on speeded classification tasks*

Exploring the reference point in prospect theory

Color name as a function of wavelength and instruction '

Title: A new statistical test for trends: establishing the properties of a test for repeated binomial observations on a set of items

Interference with spatial working memory: An eye movement is more than a shift of attention

A model of parallel time estimation

Section 3.2 Least-Squares Regression

Research Methods 1 Handouts, Graham Hole,COGS - version 1.0, September 2000: Page 1:

Measuring and Assessing Study Quality

Identification of More Risks Can Lead to Increased Over-Optimism of and Over-Confidence in Software Development Effort Estimates

BINOCULAR DEPTH PERCEPTION IN SMALL-ANGLE

Recollection Can Be Weak and Familiarity Can Be Strong

Discrimination and Generalization in Pattern Categorization: A Case for Elemental Associative Learning

ExperimentalPhysiology

Project exam in Cognitive Psychology PSY1002. Autumn Course responsible: Kjellrun Englund


GENERALIZATION GRADIENTS AS INDICANTS OF LEARNING AND RETENTION OF A RECOGNITION TASK 1

Introduction to Behavioral Economics Like the subject matter of behavioral economics, this course is divided into two parts:

THE importance of the choice situation is

ADMS Sampling Technique and Survey Studies

Congruency Effects with Dynamic Auditory Stimuli: Design Implications

Examining the Psychometric Properties of The McQuaig Occupational Test

Designed Experiments have developed their own terminology. The individuals in an experiment are often called subjects.

Appendix III Individual-level analysis

Effects of Adult Age on Structural and Operational Capacities in Working Memory

Measuring Psychological Wealth: Your Well-Being Balance Sheet

INTRODUCTION. Institute of Technology, Cambridge, MA

computation and interpretation of indices of reliability. In

Using the Patient Reported Outcome Measures Tool (PROMT)

On the Difference Between Strength-Based and Frequency-Based Mirror Effects in Recognition Memory

Exam 2 PS 306, Fall 2005

CLINICAL DEMENTIA RATING SUMMARY

Placebo and Belief Effects: Optimal Design for Randomized Trials

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2

Pooling Subjective Confidence Intervals

Persistence in the WFC3 IR Detector: Intrinsic Variability

concerns in a non-clinical sample

CHAPTER 3 RESEARCH METHODOLOGY

Mortality in relation to alcohol consumption: a prospective study among male British doctors

Magnitude and accuracy differences between judgements of remembering and forgetting

Research Ethics and Philosophies

HEREDITARY NATURE OF "HYPOTHESES"

PROFILE SIMILARITY IN BIOEQUIVALENCE TRIALS

Perceptual separability and spatial models'

Choice set options affect the valuation of risky prospects

Scale Invariance and Primacy and Recency Effects in an Absolute Identification Task

WARNING, DISTRACTION, AND RESISTANCE TO INFLUENCE 1

Measures of sensitivity based on a single hit rate and false alarm rate: The accuracy, precision, and robustness of d, A z, and A

Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination

USE OF A DELAYED SIGNAL TO STOP A VISUAL REACTION-TIME RESPONSE *

Study Guide for the Final Exam

The reality.1. Project IT89, Ravens Advanced Progressive Matrices Correlation: r = -.52, N = 76, 99% normal bivariate confidence ellipse

Review. Imagine the following table being obtained as a random. Decision Test Diseased Not Diseased Positive TP FP Negative FN TN

The reliability and validity of the Index of Complexity, Outcome and Need for determining treatment need in Dutch orthodontic practice

Sensation is the conscious experience associated with an environmental stimulus. It is the acquisition of raw information by the body s sense organs

Fundamentals of Psychophysics

1 The conceptual underpinnings of statistical power

Calibration in tennis: The role of feedback and expertise

A PCT Primer. Fred Nickols 6/30/2011

Things you need to know about the Normal Distribution. How to use your statistical calculator to calculate The mean The SD of a set of data points.

Further Properties of the Priority Rule

How financial incentives and cognitive abilities. affect task performance in laboratory settings: an illustration

Is subjective shortening in human memory unique to time representations?

Autonomic Response Specificity and Rorschach Color Responses

SUPPLEMENTARY INFORMATION. Table 1 Patient characteristics Preoperative. language testing

Basic Statistics and Data Analysis in Work psychology: Statistical Examples

Viewpoints on Setting Clinical Trial Futility Criteria

Chapter 6 Topic 6B Test Bias and Other Controversies. The Question of Test Bias

How Far Away Is That? It Depends on You: Perception Accounts for the Abilities of Others

Natural Scene Statistics and Perception. W.S. Geisler

Transcription:

Ml 3 Hi o^ &&cvi AL 44755 Perceptual and Motor Skills, 1965, 20, 311-316. Southern Universities Press 1965 m CONFIDENCE RATINGS AND LEVEL OF PERFORMANCE ON A JUDGMENTAL TASK 1 RAYMOND S. NICKERSON AND CHARLES C. MCGOLDRICK, JR. Decision Sciences Laboratory Electronic Systems Division Bedford, Massachusetts Summary. A 4-alternative forced-choice test was administered to 96 Ss. Ss task was to attempt to select the correct alternative from each test item and to indicate his degree of confidence in his choice on a 5-point rating scale. The objective was to compare the confidence assignments of Ss who did relatively well on the primary judgmental task with those of Ss who did poorly. It was found that Ss who performed poorly on the primary task (LP Ss) tended on the average to use lower confidence ratings than Ss who did relatively well (HP Ss). Although few used either high or low ratings exclusively, all Ss tended to use one end of the confidence scale much more frequently than the other. However, whereas HP Ss were fairly consistent in using the high end of the scale, LP Ss were about evenly divided between those using the high end and those using the low. For both groups, performance tended to be monotonically related to expressed confidence. In terms of measures developed by Adams and Adams, HP Ss made more "realistic" confidence judgments than did LP Ss; however, there was no striking difference between groups in terms of differences in performance associated with step increases in expressed confidence. A positive monotonic relationship between expressed confidence and objective performance measures has been obtained with a variety of judgmental tasks (e.g., Henman, 1911; Pollack & Decker, 1958; Carterette & Cole, 1959; Nickerson & McGoldrick, 1963). Unfortunately, the same confidence-performance relationship could be obtained from pooled data if either (a) each S used each confidence rating equally often, assigning high ratings relatively more frequently to correct judgments, or (b) the most frequently correct Ss used only high ratings while Ss less frequently correct used only low ones. A cursory inspection of the data of previous experiments makes it obvious that neither of these response patterns is generally the case. Typically, Ss do not use each confidence rating with equal frequency, nor do they restrict themselves exclusively to one or the other end of the scale. What is not obvious is whether Ss who perform relatively well on the primary judgmental task distribute their confidence ratings differently, e.g., more realistically, than do those whose primary task performance is low. METHOD Stimulus material and procedural details have been described in a previous report (Nickerson & McGoldrick, 1963). Only set M of these materials was 'This report is identified as Electronics Systems Division Technical Documentary Report No. ESD-TDR-64-236. Further reproduction is authorized to satisfy the needs of the U. S. Government. The authors wish to express their appreciation to Alan Meskil who prepared the figures. AD6lU6 l

312 R. S. NICKERSON & C. C. MCGOLDRICK, JR. used in the present experiment. Briefly, 5s were given a 100-item 4-alternative forced-choice test, each item consisting of the names of 4 states. 5's task was to identify the largest state (area) in each item and to indicate his degree of confidence in his choice on a 5-point scale ranging from "certain I'm correct" to "pure guess." Usual randomization and counterbalancing procedures were followed to minimize effects of patterning or position biases. Seventy-two paid undergraduate college students served as 5s. Their data were pooled with those of the 24 Group M 5s of the previous experiment to yield an N of 96. RESULTS AND DISCUSSION 5s were rank-ordered in terms of their performance on the primary judgmental task. The 24 most frequently, and the 24 least frequently, correct 5s were designated as high performance (HP) and low performance (LP) groups, respectively. The mean percent correct size judgments for the former group was 64 and for the latter, 33. A mean confidence rating was calculated for each 5 by multiplying the numerical value of each rating by its relative frequency of use and summing over ratings. Means of the individual means were 3.74 and 3.05 for HP and LP groups, respectively. The individual means were rank-ordered and a Mann- Whitney U test showed the difference between groups to be significant (p <.02). From this it may be concluded that, on the average, LP 5s tended to use lower confidence ratings than did HP 5s. This may also be seen from the distributions of ratings for the two groups shown in Fig. 1. It may be noted that, although the LP distribution clearly is shifted toward the low end of the scale relative to the HP distribution, LP 5s, as a group, used > A A HP.50 D D LP LlJ 3 O.40 UJ.30 > H < a-s-y.20 hj ^r.10 A-- A FIG. 1. 12 3 4 5 CONFIDENCE RATING Distribution of confidence ratings for high and low performance groups

CONFIDENCE RATINGS AND PERFORMANCE LEVELS 313 the two highest confidence ratings at least as frequently as the two lowest in spite of their near chance performance on the primary task. It must be emphasized, however, that the nearly rectangular distribution shown in Fig. 1 was not characteristic of individuals within the group. In fact for every S of both groups the departure from a rectangular distribution of ratings was significant. [For all 5s but 1, x~ > 14 (> <.01); for the remaining 5, x 2 > 12 (p <.02).] Twenty of the 24 Ss of each group used one end of the confidence scale (two highest or two lowest ratings) at least 3 times as frequently as the opposite end. Whereas most (21) of the HP Ss used the high end of the scale, the LP group was fairly evenly divided between Ss who tended to use one end of the scale and those who tended to use the other; hence the near rectangular distribution of Fig. 1. &-- ~ HP i.oo- A -D L p m.80- o h- <r r^jf. '"M. <.60 J - ->^ H? o.40- yc-rf.20- XfcyC. 12 3 4 5 CONFIDENCE RATING chance line FlG. 2. Ratio of number correct to total number of judgments associated with each confidence rating for each performance group Ratios of correct-to-total (CT) number of judgments associated with each confidence rating were computed and are displayed in Fig. 2. The solid lines represent CT ratios computed from pooled data; the broken lines represent averages of the CT ratios of individual Ss. The two procedures appear to give fairly similar plots in this case; however, they could give very dissimilar ones. With pooled data, an individual S's influence in the determination of a particular CT ratio depends on how frequently (relative to other Ss) he made use of the associated rating. In the alternative case each S is an equally important determinant of each C T ratio irrespective of the frequency with which he used the different ratings (except in cases in which he failed to use a particular rating at all). This procedure has the unfortunate property that a CT

314 R- S. NICKERSON & C. C. MCGOLDRICK, JR. ratio of 1.00 based on, say, 50 responses of one S may be offset by a C,T ratio of 0 based on a single response of another S. As suggested by Fig. 2 and borne out by a test of trend (Hayes, 1957), CT ratio tended to increase with confidence level for both groups (p <.01). However, in neither group were differences in confidence accompanied by proportional differences in performance, except possibly at the upper end of the confidence scale. If the lowest and highest ratings are interpreted as predictions of chance and perfect performance, respectively (which is consistent with S's instructions), and the rating scale is treated as an equal interval scale, then the diagonal line represents the ideal relationship between CT ratio and expressed confidence. That is, points lying on the diagonal represent maximal agreement between performance "predicted" from the confidence assignments and that in fact obtained. (Actually perfect agreement is not to be expected even given an ideal assignment of the confidence ratings since they are restricted to five discrete values whereas CT ratio is not; however, the error due to this fact would be sufficiently slight to be inconsequential in the present context.) Points below the diagonal are suggestive of overconfidence in the sense that the associated confidence ratings predict better performance than was actually obtained. Conversely, points above the diagonal suggest underconfidence since the obtained performance was better than that predicted by the ratings. As measures of realism of confidence expressions Adams and Adams (1961) have developed algebraic and absolute discrepancy scores defined, respectively, as Z(pt-Pt) HI-«I and S pi-pi V W2\f»i, "in which Pi is the percentage correct at confidence p, and «, is the number of decisions made with confidence p,." The algebraic discrepancy score is equivalent to the algebraic difference between mean confidence and the total per cent correct, and gives an indication of general overconfidence or underconfidence. The absolute discrepancy score gives a weighted average absolute difference between per cent correct observed and that predicted by the confidence assignments. Although, in this experiment, confidence ratings were defined in terms of pi only in the sense that the ends of the scale were explicitly associated with chance and certainty, for purposes of analysis the five ratings were replaced with pi ranging in equal steps from 25 (chance) to 100. (Whether comparable results would be obtained if Ss were required to actually express confidence in terms of probabilities or expected percentages is an empirical question and can be determined only by further experimentation.) Discrepancy scores were computed for each S of both groups. All but 5 Ss, each of whom was in the HP group, had positive algebraic scores suggesting

CONFIDENCE RATINGS AND PERFORMANCE LEVELS 315 general overconfidence. The mean algebraic discrepancy scores for HP and LP groups, respectively, were 11.6 and 29.8; mean absolute discrepancy scores for the two groups were 22.4 and 33.5. Mann-Whitney U tests showed the between groups differences to be significant in both cases (p <.01). It appears that, in terms of the measures proposed by Adams and Adams, Ss who performed best on the primary task tended to assign confidence ratings more realistically than did those who performed poorly. Some caution is necessary in interpreting these measures since both may vary strictly as a function of performance on the primary task. As an extreme but convincing illustration, consider the case of two hypothetical Ss, one of whom is considerably better than the other with respect to the primary decision task, but both of whom assign confidence judgments by pulling them out of a hat. Such an experiment might yield relationships between per cent correct and confidence similar to those illustrated in Fig. 3. Line a represents better perform- 1.00-.80- o i- <.60- rr H o.40-.20- ' 12 3 4 CONFIDENCE RATING a ^J> chance line FIG. 3- Plot of hypothetical data illustrating the possibility of differences in disctepancy scores resulting solely from differences in level of performance on the primary task ance on the primary decision task than does line b, but both represent a random assignment of confidence ratings. However, the data represented by line a would yield a considerably smaller algebraic or absolute discrepancy score, than would those represented by line b. It might be argued that even in this case it is not unreasonable to speak of degrees of realism in the confidence assignments since indeed the correspondence between obtained performance and rhat predicted by the ratings, however the latter were arrived at, is unquestionably greater in one case than in the other. However, this would seem to stretch the

316 R- S. NICKERSON & C. C. MC GOLDRICK, JR. connotation of realism somewhat beyond its accepted domain and moreover does not seem to be consistent with Adams and Adams' use of the word. Perhaps a distinction should be made between realism, as defined by Adams and Adams, and what might be called sensitivity; the former denoting the degree of correspondence between predicted and obtained performance measures in terms of absolute values, and the latter reflecting simply the extent to which differences in confidence are indicative of differences in performance on the primary task. With reference to the data represented in Fig. 2, sensitivity is reflected roughly in the slope of a curve; whereas, realism, as measured by Adams and Adams, varies with both slope and intercept. As a crude test for between-groups differences in sensitivity, differences in C'T ratio associated with step increases in confidence were obtained for each S. For each step increase in confidence the CT differences were rank-ordered and a Mann-Whitney U test was made on the ranks. Significant between-groups differences were obtained only in the case of performance increments associated with the confidence increment 4 to 5 (p <.05). It must be remembered, however, that the probability of getting p <.05 by chance in at least one out of 4 different tests is considerably greater than.05; hence, the significance of the difference obtained is questionable. Although we are not justified on the basis of a Mann-Whitney test to conclude that differences do not exist even in those cases in which p <.05 was not obtained, inspection of Fig. 2 suggests that whatever between-group differences there may be are not very large. The plots look quite similar except for differences in intercept. REFERENCES ADAMS, J. K., & ADAMS, P. A. Realism in confidence judgments. Psychol. Rev., 1961, 68, 33-45. CARTERETTE, E. C, & COLE, M. A comparison of the receiver operating characteristics for messages received by ear and by eye. Dept. of Navy, ONR, Report No. 2, June, 1959. HAYES, J. R. M. A non-parametric test of trend. Psychol. Newsltr, 1957, 9, 29-34. HENMAN, V. A. C. The relation of the time of a judgment to its accuracy. Psychol. Rev., 1911, 18, 186-201. NICKERSON, R. S., & McGOLDRICK, C. C. Confidence, correctness, and difficulty with non-psychophysical comparative judgments. Percept, mot. Skills, 1963, 17, 159-167. POLLACK, I., & DECKER, L. R. Confidence ratings, message reception and the receiver operating characteristics. J. Acoust. Soc. Amer., 1958, 30, 286-292. Accepted December 10, 1964.