Empirical Formula for Creating Error Bars for the Method of Paired Comparison

Similar documents
Louis Leon Thurstone in Monte Carlo: Creating Error Bars for the Method of Paired Comparison

Detection Theory: Sensitivity and Response Bias

This is a repository copy of Monte Carlo analysis of incomplete paired-comparison experiments.

Categorical Perception

Lecturer: Rob van der Willigen 11/9/08

Lecturer: Rob van der Willigen 11/9/08

Detection Theory: Sensory and Decision Processes

Absolute Identification is Surprisingly Faster with More Closely Spaced Stimuli

Absolute Identification Is Relative: A Reply to Brown, Marley, and Lacouture (2007)

Modeling Human Perception

A BAYESIAN SOLUTION FOR THE LAW OF CATEGORICAL JUDGMENT WITH CATEGORY BOUNDARY VARIABILITY AND EXAMINATION OF ROBUSTNESS TO MODEL VIOLATIONS

AN ALTERNATIVE METHOD FOR ASSESSING LIKING: POSITIONAL RELATIVE RATING VERSUS THE 9-POINT HEDONIC SCALE ABSTRACT

STUDY OF THE BISECTION OPERATION

An Ideal Observer Model of Visual Short-Term Memory Predicts Human Capacity Precision Tradeoffs

Laboratory exercise: Adaptive tracks and psychometric functions

The Color Between Two Others

COMPARISON OF DIFFERENT SCALING METHODS FOR EVALUATING FACTORS IMPACT STUDENTS ACADEMIC GROWTH

CHAPTER 3 METHOD AND PROCEDURE

Framework for Comparative Research on Relational Information Displays

The Effect of Guessing on Item Reliability

Investigating the robustness of the nonparametric Levene test with more than two groups

ISO 20462, A psychophysical image quality measurement standard

Mantel-Haenszel Procedures for Detecting Differential Item Functioning

Technical Specifications

The Use of Item Statistics in the Calibration of an Item Bank

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

General recognition theory of categorization: A MATLAB toolbox

TESTING A NEW THEORY OF PSYCHOPHYSICAL SCALING: TEMPORAL LOUDNESS INTEGRATION

ELEMENTS OF PSYCHOPHYSICS Sections VII and XVI. Gustav Theodor Fechner (1860/1912)

Analysis of the Reliability and Validity of an Edgenuity Algebra I Quiz

Effects of Sequential Context on Judgments and Decisions in the Prisoner s Dilemma Game

Information Processing During Transient Responses in the Crayfish Visual System

Roots of Experimental Psychology: Psychophysics and Memory. Psychophysics: First Empirical Investigations of the Mind

The SAGE Encyclopedia of Educational Research, Measurement, and Evaluation Multivariate Analysis of Variance

Sage Publications, Inc. and American Sociological Association are collaborating with JSTOR to digitize, preserve and extend access to Sociometry.

STATISTICS AND RESEARCH DESIGN

Classical Psychophysical Methods (cont.)

Weighting the Seriousness of Perceived Health Problems Using Thurstone's Method of Paired Comparisons

Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination

A contrast paradox in stereopsis, motion detection and vernier acuity

Jackknife-based method for measuring LRP onset latency differences

A Memory Model for Decision Processes in Pigeons

Intransitivity on Paired-Comparisons Instruments:

AQCHANALYTICAL TUTORIAL ARTICLE. Classification in Karyometry HISTOPATHOLOGY. Performance Testing and Prediction Error

Limitations of Additive Conjoint Scaling Procedures: Detecting Nonadditivity When Additivity Is Known to Be Violated

Psy201 Module 3 Study and Assignment Guide. Using Excel to Calculate Descriptive and Inferential Statistics

Spreadsheet signal detection

GENERALIZATION GRADIENTS AS INDICANTS OF LEARNING AND RETENTION OF A RECOGNITION TASK 1

computation and interpretation of indices of reliability. In

DO STIMULUS FREQUENCY EFFECTS OCCUR WITH LINE SCALES? 1. Gert Haubensak Justus Liebig University of Giessen

PSYC 441 Cognitive Psychology II

Perceptual separability and spatial models'

Morton-Style Factorial Coding of Color in Primary Visual Cortex

Evaluating normality of procedure up down trasformed response by Kolmogorov Smirnov test applied to difference thresholds for seat vibration

Comparison of the Null Distributions of

Issues Surrounding the Normalization and Standardisation of Skin Conductance Responses (SCRs).

Birds' Judgments of Number and Quantity

Estimating Reliability in Primary Research. Michael T. Brannick. University of South Florida

A SAS Macro to Investigate Statistical Power in Meta-analysis Jin Liu, Fan Pan University of South Carolina Columbia

ASSESSING THE UNIDIMENSIONALITY, RELIABILITY, VALIDITY AND FITNESS OF INFLUENTIAL FACTORS OF 8 TH GRADES STUDENT S MATHEMATICS ACHIEVEMENT IN MALAYSIA

Free classification: Element-level and subgroup-level similarity

Sawtooth Software. The Number of Levels Effect in Conjoint: Where Does It Come From and Can It Be Eliminated? RESEARCH PAPER SERIES

Measuring the Impossible : Quantifying the Subjective (A method for developing a Risk Perception Scale applied to driving)

Fundamentals of Psychophysics

MEA DISCUSSION PAPERS

On testing dependency for data in multidimensional contingency tables

Pushing the Right Buttons: Design Characteristics of Touch Screen Buttons

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking

Citation for published version (APA): Ebbes, P. (2004). Latent instrumental variables: a new approach to solve for endogeneity s.n.

METHODOLOGICAL COMMENTARY. Payne and Jones Revisited: Estimating the Abnormality of Test Score Differences Using a Modified Paired Samples t Test*

Using SAS to Calculate Tests of Cliff s Delta. Kristine Y. Hogarty and Jeffrey D. Kromrey

Steven A. Cholewiak ALL RIGHTS RESERVED

Statistical Audit. Summary. Conceptual and. framework. MICHAELA SAISANA and ANDREA SALTELLI European Commission Joint Research Centre (Ispra, Italy)

Chapter 02. Basic Research Methodology

Likelihood ratio, optimal decision rules, and relationship between proportion correct and d in the dual-pair AB-versus-BA identification paradigm

Learning to classify integral-dimension stimuli

Three methods for measuring perception. Magnitude estimation. Steven s power law. P = k S n

Psychology of Perception Psychology 4165, Fall 2001 Laboratory 1 Weight Discrimination

MODELLING CHARACTER LEGIBILITY

Section 6: Analysing Relationships Between Variables

Examining the efficacy of the Theory of Planned Behavior (TPB) to understand pre-service teachers intention to use technology*

The implications of processing event sequences for theories of analogical reasoning

Monte Carlo Analysis of Univariate Statistical Outlier Techniques Mark W. Lukens

Child Neuropsychology, in press. On the optimal size for normative samples in neuropsychology: Capturing the

Running head: INDIVIDUAL DIFFERENCES 1. Why to treat subjects as fixed effects. James S. Adelman. University of Warwick.

Imperfect, Unlimited-Capacity, Parallel Search Yields Large Set-Size Effects. John Palmer and Jennifer McLean. University of Washington.

Behavioral Data Mining. Lecture 4 Measurement

TRAVEL TIME AS THE BASIS OF COGNITIVE DISTANCE*

Ranking: Perceptions of Tied Ranks and Equal Intervals on a Modified Visual Analog Scale

SAMENESS AND REDUNDANCY IN TARGET DETECTION AND TARGET COMPARISON. Boaz M. Ben-David and Daniel Algom Tel-Aviv University

Issues That Should Not Be Overlooked in the Dominance Versus Ideal Point Controversy

Measuring mathematics anxiety: Paper 2 - Constructing and validating the measure. Rob Cavanagh Len Sparrow Curtin University

MBA 605 Business Analytics Don Conant, PhD. GETTING TO THE STANDARD NORMAL DISTRIBUTION

A STATISTICAL PATTERN RECOGNITION PARADIGM FOR VIBRATION-BASED STRUCTURAL HEALTH MONITORING

SUPPLEMENTARY INFORMATION. Table 1 Patient characteristics Preoperative. language testing

CHAPTER 3 RESEARCH METHODOLOGY

RELIABILITY RANKING AND RATING SCALES OF MYER AND BRIGGS TYPE INDICATOR (MBTI) Farida Agus Setiawati

One-Way ANOVAs t-test two statistically significant Type I error alpha null hypothesis dependant variable Independent variable three levels;

Psychometric Scaling: Avoiding the Pitfalls and Hazards

INTRODUCTION. Institute of Technology, Cambridge, MA

Transcription:

Empirical Formula for Creating Error Bars for the Method of Paired Comparison Ethan D. Montag Rochester Institute of Technology Munsell Color Science Laboratory Chester F. Carlson Center for Imaging Science 54 Lomb Memorial Drive Rochester, NY 14623 E-mail: montag@cis.rit.edu Abstract. The method of paired comparison based on Thurstone s Case V of his Law of Comparative Judgments is often used as a psychophysical method to derive interval scales of perceptual qualities in imaging applications. However, methods for determining confidence intervals and critical distances for significant differences have been elusive leading some to abandon the simple analysis provided by Thurstone s formulation. Monte Carlo simulations of paired comparison experiments were performed in order to derive an empirical formula for determining error. The results show that the variation in the distribution of experimental results can be well predicted as a function of stimulus number and the number of observations. Using these results, confidence intervals and critical values for comparisons can be made using traditional statistical methods. Key words: psychophysics, image quality, paired comparison, confidence intervals 1 Introduction There are many psychophysical techniques for creating scales of perception and sensation. Common techniques, such as the classical psychophysical threshold methods or ratio scaling methods can be used to create scales of sensory magnitude. 1-3 However, for judgments of image quality or image preference, these techniques may not be applicable. For example, if one wants to determine which of several printers produces prints that are of highest image quality, there is no one continuous physical parameter that is being manipulated. In addition, the scale that is to be created is not a ratio scale because there exists no concept of zero quality. What is desired is a scale that assigns numbers to the psychological percept that allows comparisons of the different stimuli. Such techniques for creating interval scales include different ranking, sorting, and comparison procedures and statistical analyses that transform the subjects ratings into interval scales. The method of paired comparison has become a popular tool for evaluating the effect of various algorithms or treatments on image quality or for quantifying the change in a perceptual characteristic (Engeldrum s -nesses 4 ) such as perceived contrast, sharpness, graininess, etc. Specifically, reference is made to Thurstone s Law of Comparative Judgment. 5-7 In its original formulation, 6 Thurstone presents a simple theory of the discriminal process and how its nature allows the construction of an interval scale based on comparisons of pairs of stimuli. This has been the starting point for much research in the application and analysis of paired comparison

data. However, in its original formulation and simplifying assumptions, the analysis of paired comparison data is computationally easy and straightforward. This said, we have had some problems in implementing Thurstone s Law because the ability to compute confidence intervals is missing in the formulation. Later research has used modifications of Thurstone s Law to produce error metrics, 8 but this seems to be at a cost of losing the inherent simplicity of Thurstone s original formulation. In this paper we use Monte Carlo simulations to estimate an empirical formula for constructing error bars based on Thurstone s Law. 2 Thurstone s Law of Comparative Judgment Thurstone presents the Law of Comparative Judgment 6 based on the following propositions: 1. Each stimulus gives rise to a discriminal process, which has some value on the psychological continuum of interest. 2. Due to momentary fluctuations (occurring within or between observers), the value of a stimulus may vary on repeated presentations. The distribution of this fluctuation can be characterized by a postulated normal distribution. 3. The mean and standard deviation of the distribution associated with a stimulus are its internal scale values and discriminal dispersion, respectively. 4. Therefore the distribution of the difference between two stimuli is also normally distributed and it is a function of the proportion that one stimulus is chosen as greater than the other. 5. The difference in scale values, R, between two stimuli, i and j, is: R i " R j = z ij # i 2 + # j 2 " 2r ij # i # j (1) where R i and R j represent the scale values of stimuli i and j, σ i and σ j are the standard deviations of the respective discriminal dispersions, r ij is the correlation between the two discriminal processes, and z ij is the normal deviate (the z-score) corresponding to the proportion of times stimulus j is judged is judged greater along the psychological continuum than stimulus i. For Thurstone s Case V, we assume that: 1) The evaluation of one stimulus along the continuum does not influence the evaluation of the other in the paired comparison (r ij = 0) and 2) The dispersions are equal for all stimuli (σ i = σ j ). These assumptions lead to this formulation of the Law: R i " R j = z ij # 2 (2) Procedures for testing these assumptions (and the normality of the discriminal distribution) have been provided but for our purposes in paired comparison along continua such as image quality or preference we accept these assumptions. 2

The analysis of data, according to this equation, is as follows: With n stimuli, n(n-1)/2 stimulus pairs are presented N times for judgment where N is the product of the number of presentations of each pair by each observer by the total number of observers. A frequency matrix (where the i th column entry is chosen over the j th row entry) of these judgments is constructed and then converted into proportions by diving by N. In turn the proportion are converted into normal deviates and the average of the columns create the interval scale values for the stimuli. There seems to be no simple solution in the literature for determining the confidence intervals based on the Case V solution. Bock and Jones 8 present an equation for calculating confidence intervals based on the arcsine transformation rather than the normal deviate. David 9 presents other methods for determining the statistical significance of scale value differences but these too rely on different analyses of the data. In the past, we have used Eq. (2) to calculate confidence 95% confidence intervals were CI = R±1.96(0.707/ N) where the term (0.707/ N) reflects an estimate of the standard error but we have never quite been satisfied with this because we surmised that the number of stimuli must also play a role in determining the error. Fortunately, the confidence intervals that result from this equation are quite conservative so that our previous experimental findings are not put into jeopardy. 3 Monte Carlo Simulation To determine how scale values can vary and then empirically determine the standard deviation of these values for the construction of confidence intervals and tests of significance a Monte Carlo simulation of the paired comparison experiment was performed. By assuming an underlying psychological continuum that conforms to Thurstone s Case V and randomly sampling from normal distributions along this continuum we can recreate the distribution of results that are expected over many repetitions of the experiment. 3.1 The Simulation For n stimuli, we chose n values as the means of normal distributions all with equal standard deviations. Samples were chosen from each of the n(n-1)/2 pairs of distributions and compared and the results were tallied. This was done N times for the N observations in the experiment. This would constitute one experiment in which the scale values were calculated as described above. This experiment was then repeated many times so that the means and standard deviations of the scale values could be calculated. The mean of the standard deviations of each of the scale values was then calculated to determine the overall standard deviation expected from an experiment with n stimuli and N observations. So, for example, for n = 6 stimuli, 6 normal distributions with means located at [5 6 7 8 9 10] and a standard deviation of 5 were used for sampling of the stimulus pairs. (A large standard deviation along the psychological continuum was chosen to reduce the number of results that contained unanimous judgments.) This was repeated for, say, N = 30 observations. For each pair of n and N, the experiment was repeated 10,000 times and the means and standard deviations of the 6 scale values were calculated. The mean of the 6 standard deviations was taken as the overall observed standard deviation seen in an experiment. 3

The number of stimuli used in the experiment was n = [4 5 6 7 8 10 12 15]. The number of observation was N = [10 20 30 40 50 60]. Each experiment was repeated 10,000 times except when n = 15 which was repeated 5,000 times because of limitations in computer memory. These simulations were repeated twice with the standard deviation of the underlying psychological distribution at a value of 5 and 6. The locations of the means of the distributions along the psychological continuum and their standard deviations did not have an effect on the results, as would be expected from the theory. Figure 1 shows the results of the simulation. The observed standard deviation is a function of both the number of stimuli, n, and the number of observations, N. 3.2 Estimates of the observed standard deviation Abandoning an analytic approach, a number of equations were fit to the data to find an empirical equation that could capture the observed standard deviation as a function of the number of stimuli and observations. The following equation gave a good fit: " obs = b 1 ( n # b 2 ) b 3 ( N # b 4 ) b 5 (3) with b 1 = 1.76, b 2 = -3.08, b 3 = -0.613, b 4 = 2.55, and b 5 = -0.491. The solid lines in Figure 1 show the fit of this equation to the data. The percent rms of this fit was 0.87. The fit is very good, which is not remarkable given that there are 5 parameters. It does seem to overestimate the standard deviation at N = 10. This is not a bad feature considering that with such a small number of observers there is much less power in the experiment and the chance of unanimous judgments occurring by chance is increased. Repetition of the simulation with various values of n, N, and mean values on the psychological continuum verify that this equation gives an excellent approximation of the observed variability. Insert Figure 1 about here 3.3 Implementation To calculate 95% confidence intervals (CI) around the interval scale values determined in any particular experiment, one can use the expression CI = ±1.96σ obs. In designs of paired comparison experiments that reduce experimental labor 7 the expression can also be used to adjust the size of the CIs for the various experimental partitions. Besides its use for determining confidence intervals, the value of σ obs can be useful in determining more stringent critical values and confidence intervals for multiple comparisons of scale values. It is recognized that multiple comparisons of pairs of means in the analysis of variance leads to an increase in errors. It must also be true for the comparison of multiple scale values in paired comparison experiments. Keppel 10 enumerates a number of planned and post hoc tests such as the Scheffé, Dunnett, and 4

Tukey tests for controlling error rate for multiple comparisons. Using an empirical value of the standard deviation in the place of the standard error in these tests may prove useful. 3.4 Goodness of fit In the course of doing the simulations, Mosteller s χ 2 test of goodness of fit 11 was performed on each of the individual experiments. This test compares the arcsine transform of the observed proportions to those predicted by the resulting scale values. The proportion of the experiments that were rejected based on this test for the different combinations of n and N were calculated. Surprisingly, these proportions were never less than 0.05 even thought the underlying distributions from which the observations were drawn were normally distributed, conforming to Case V assumptions, and unidimensional. This would lend one to think that this test is conservative, although Engledrum 2 cites Mosteller s report 11 that the model often appears better than it really is. 4 Conclusions In this paper Monte Carlo simulations were used to estimate an empirical formula for the standard deviation of scale values and the constructing confidence intervals based on Thurston s Law. It is recognized that a thorough understanding of the underpinnings of statistical distributions, experimental noise, and mathematical precision would lead to a much more satisfactory answer, but for now we are more interested in creating a tool for the analysis and determination of the significance of our data. References 1. Bartleson, C. James, and Franc Grum, eds. Optical Radiation Measurements, Vol. 5, Visual Measurements. New York: Academic Press, Inc, 1984. 2. P. G. Engeldrum, Psychometric Scaling, Imcotek Press, Winchester, MA (2000). 3. G. A. Gescheider, Psychophysics: The Fundamentals, Lawrence Erlbaum Assoc., Publishers, Mahwah, NJ (1997). 4. P. G. Engeldrum, "A framework for image quality models," Journal of Imaging Science and Technology 39(4), 312-318 (1995). 5. L. L. Thurstone, "Psychophysical analysis," American Journal of Psychology 38, 368-389 (1927). 6. L. L. Thurstone, "A law of comparative judgment," Psychological Review 34, 273-286 (1927). 7. W. S. Torgeson, Theory and Methods of Scaling, John Wiley & Sons, New York (1958). 8. R. D. Bock, and L. V. Jones, The Measurement and Prediction of Judgment and Choice, Holden-Day, San Francisco (1968). 9. H. A. David, The Method of Paired Comparison, Hafner Press, New York (1969). 10. G. Keppel, Design and AnalysisA Researcher's Handbook, Prentice-Hall, Inc, Englewood Cliffs, NJ (1982). 11. F. Mosteller, "Remarks on the method of paired comparison: III. A test of significance for paired comparisons when equal standard deviations and equal correlations are assumed," Psychometrika 16(2), 207-218 (1951). 5

Figure captions: Fig. 1 Results of the simulations of paired comparison experiments showing the observed average standard deviation of the scale values as a function of the number of observations (N) and stimuli (n) with fit of Eq. 3.

Observed Standard Deviation 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 n = 15, st.dev. = 5 n = 12, st.dev. = 5 n = 10, st.dev. = 5 n = 8, st.dev. = 5 n = 7, st.dev. = 5 n = 6, st.dev. = 5 n = 5, st.dev. = 5 n = 4, st.dev. = 5 n = 15, st.dev. = 6 n = 12, st.dev. = 6 n = 10, st.dev. = 6 n = 8, st.dev. = 6 n = 7, st.dev. = 6 n = 6, st.dev. = 6 n = 5, st.dev. = 6 n = 4, st.dev. = 6 0.04 0 10 20 30 40 50 60 70 Number of Observations (N)