Sawtooth Software. A Parameter Recovery Experiment for Two Methods of MaxDiff with Many Items RESEARCH PAPER SERIES

Similar documents
Sawtooth Software. MaxDiff Analysis: Simple Counting, Individual-Level Logit, and HB RESEARCH PAPER SERIES. Bryan Orme, Sawtooth Software, Inc.

Following is a list of topics in this paper:

Audio: In this lecture we are going to address psychology as a science. Slide #2

Functionalist theories of content

Reliability and Validity checks S-005

Assignment 4: True or Quasi-Experiment

Carrying out an Empirical Project

Sawtooth Software. The Number of Levels Effect in Conjoint: Where Does It Come From and Can It Be Eliminated? RESEARCH PAPER SERIES

Who? What? What do you want to know? What scope of the product will you evaluate?

Propensity Score Methods for Estimating Causality in the Absence of Random Assignment: Applications for Child Care Policy Research

Attribute Importance Research in Travel and Tourism: Are We Following Accepted Guidelines?

Investigations of Cellular Automata Dynamics

Learning Styles Questionnaire

Crossing boundaries between disciplines: A perspective on Basil Bernstein s legacy

PEER REVIEW HISTORY ARTICLE DETAILS VERSION 1 - REVIEW. Ball State University

Title: A new statistical test for trends: establishing the properties of a test for repeated binomial observations on a set of items

Economics 2010a. Fall Lecture 11. Edward L. Glaeser

IMPROVING MENTAL HEALTH: A DATA DRIVEN APPROACH

How to Help Your Patients Overcome Anxiety with Mindfulness

P R I M H D s u m m a r y r e p o r t u s e r g u i d e - H o N O S C A Health of the nation outcome scales child and adolescent

Student Name: XXXXXXX XXXX. Professor Name: XXXXX XXXXXX. University/College: XXXXXXXXXXXXXXXXXX

Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data

Huber, Gregory A. and John S. Lapinski "The "Race Card" Revisited: Assessing Racial Priming in

Child Outcomes Research Consortium. Recommendations for using outcome measures

THE IMPACT OF THE NEUTRAL POINT IN THE PAIRED COMPARISONS OF STATEMENTS MODEL (Dedicated to the memory of Prof. Roberto Corradetti)

The PRIMHD Summary Report is organised into three major sections that provide information about:

ON GENERALISING: SEEING THE GENERAL IN THE PARTICULAR AND THE PARTICULAR IN THE GENERAL IN MARKETING RESEARCH

A Comparison of Collaborative Filtering Methods for Medication Reconciliation

To link to this article:

Choosing Life: Empowerment, Action, Results! CLEAR Menu Sessions. Substance Use Risk 2: What Are My External Drug and Alcohol Triggers?

PAIRED COMPARISONS OF STATEMENTS: EXPLORING THE IMPACT OF DESIGN ELEMENTS ON RESULTS

Module 14: Missing Data Concepts

TTI Personal Talent Skills Inventory Coaching Report

Making comparisons. Previous sessions looked at how to describe a single group of subjects However, we are often interested in comparing two groups

Definitions Disease Illness The relationship between disease and illness Disease usually causes Illness Disease without Illness

CHAPTER 2. MEASURING AND DESCRIBING VARIABLES

EVERYTHING DiSC COMPARISON REPORT

August 29, Introduction and Overview

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2

Probability and Statistics Chapter 1 Notes

An Empirical Study on Causal Relationships between Perceived Enjoyment and Perceived Ease of Use

How Does Analysis of Competing Hypotheses (ACH) Improve Intelligence Analysis?

Dilemmatics The Study of Research Choices and Dilemmas

Piaget s theory of cognitive development: schemas, assimilation, accommodation, equilibration, stages of intellectual development. Characteristics of

What is Science 2009 What is science?

Regression CHAPTER SIXTEEN NOTE TO INSTRUCTORS OUTLINE OF RESOURCES

What are Indexes and Scales

White Supremacy Culture perfectionism antidotes sense of urgency

Are the likely benefits worth the potential harms and costs? From McMaster EBCP Workshop/Duke University Medical Center

Introduction. Preliminary POV. Additional Needfinding Results. CS Behavioral Change Studio ASSIGNMENT 2 POVS and Experience Prototypes

Color naming and color matching: A reply to Kuehni and Hardin

Assignment 4: True or Quasi-Experiment

Using Effect Size in NSSE Survey Reporting

Measuring the User Experience

Exploiting Similarity to Optimize Recommendations from User Feedback

Context of Best Subset Regression

Sample size calculation a quick guide. Ronán Conroy

Chapter 4: Defining and Measuring Variables

UNLOCKING VALUE WITH DATA SCIENCE BAYES APPROACH: MAKING DATA WORK HARDER

Towards Reflexive Practice

Clever Hans the horse could do simple math and spell out the answers to simple questions. He wasn t always correct, but he was most of the time.

Title: A robustness study of parametric and non-parametric tests in Model-Based Multifactor Dimensionality Reduction for epistasis detection

To provide you with necessary knowledge and skills to accurately perform 3 HIV rapid tests and to determine HIV status.

S Imputation of Categorical Missing Data: A comparison of Multivariate Normal and. Multinomial Methods. Holmes Finch.

7.NPA.3 - Analyze the relationship of nutrition, fitness, and healthy weight management to the prevention of diseases such as diabetes, obesity,

AUDIOLOGY MARKETING AUTOMATION

Ursuline College Accelerated Program

Answer to Anonymous Referee #1: Summary comments

Neighbourhood Connections report on the 2016 External Partner Survey

3 CONCEPTUAL FOUNDATIONS OF STATISTICS

Chapter 1 Introduction to Educational Research

The reality.1. Project IT89, Ravens Advanced Progressive Matrices Correlation: r = -.52, N = 76, 99% normal bivariate confidence ellipse

How Many Options? Behavioral Responses to Two Versus Five Alternatives Per Choice. Martin Meissner Harmen Oppewal and Joel Huber

Measurement and meaningfulness in Decision Modeling

Selection and Combination of Markers for Prediction

P E R S P E C T I V E S

Progress Monitoring Handouts 1

The Logic of Data Analysis Using Statistical Techniques M. E. Swisher, 2016

Recovering Families: A Tool For Parents in Recovery August 1, 2016

The Clinical Utility of the Relational Security Explorer. Verity Chester, Research and Projects Associate, St Johns House, Norfolk

Can We Assess Formative Measurement using Item Weights? A Monte Carlo Simulation Analysis

Increasing Social Awareness in Adults with Autism Spectrum Disorders

Homework Answers. 1.3 Data Collection and Experimental Design

Recommender Systems Evaluations: Offline, Online, Time and A/A Test

Learn 10 essential principles for research in all scientific and social science disciplines

FUNCTIONAL CONSISTENCY IN THE FACE OF TOPOGRAPHICAL CHANGE IN ARTICULATED THOUGHTS Kennon Kashima

NUTRITION. Step 1: Self-Assessment Introduction and Directions

Evaluating Quality in Creative Systems. Graeme Ritchie University of Aberdeen

CHAPTER THIRTEEN. Data Analysis and Interpretation: Part II.Tests of Statistical Significance and the Analysis Story CHAPTER OUTLINE

Financial Well-Being Scale

Lesson 1: Making and Continuing Change: A Personal Investment

1. What is the I-T approach, and why would you want to use it? a) Estimated expected relative K-L distance.

Plato's Ambisonic Garden

SMS USA PHASE ONE SMS USA BULLETIN BOARD FOCUS GROUP: MODERATOR S GUIDE

DRUG USE OF FRIENDS: A COMPARISON OF RESERVATION AND NON-RESERVATION INDIAN YOUTH

Summary Report: The Effectiveness of Online Ads: A Field Experiment

Exercise in New Zealand - Part 1 What consumers think about exercise and the opportunities for the exercise industry

EPSE 594: Meta-Analysis: Quantitative Research Synthesis

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups

Online Assessment Instructions

Transcription:

Sawtooth Software RESEARCH PAPER SERIES A Parameter Recovery Experiment for Two Methods of MaxDiff with Many Items Keith Chrzan, Sawtooth Software, Inc. Copyright 2015, Sawtooth Software, Inc. 1457 E 840 N Orem, Utah +1 8010477 4700 www.sawtoothsoftware.com

A Parameter Recovery Experiment for Two Methods of MaxDiff with Many Items Keith Chrzan, Sawtooth Software January, 2015 Background Clients don t seem to be able to get enough of a good thing and this seems to apply more to MaxDiff than to some of the other methods we use: clients frequently ask for MaxDiff experiments that include more items than would allow us to expose each item to each respondent the recommended three or four times. In a paper at a recent Sawtooth Software Conference, Wirth and Wolfrath (2012) introduced two methods for handling large numbers of items in MaxDiff studies. Express MaxDiff creates different subsets of the large number of items and then asks a given respondent MaxDiff questions that just include that subset of items, with each respondent seeing each of the reduced number of items the recommended three to four times and with different respondents seeing different subsets of attributes. Express MaxDiff relies upon the magic of HB analysis to fill in the blanks and supply utilities for the items missing from a given respondent s experiment (essentially imputing them based on the population means and covariances). With Sparse MaxDiff, on the other hand, each respondent sees all the items in the study, but fewer than the recommended number of times (i.e. perhaps just once). In an empirical study Wirth and Wolfrath compared the ability of Express and Sparse MaxDiff to predict a few holdout questions: they found that both methods predicted best choices about equally but that Sparse did a better job predicting worst choices. Wirth and Wolfrath (W&W) also conducted a parameter recovery experiment for Express MaxDiff but not for Sparse MaxDiff. An experiment comparing the ability of Express and Sparse MaxDiff to recover known utilities would help us understand which of the two works better. The following sections detail two such experiments using data sets whose owners allowed sharing of the data. Study Methodology The first study had 84 items (and 729 respondents) and the second 90 items (and 200 respondents). Using the HB utilities from these studies as the known true utilities grounds the artificial data experiment as realistically as possible in that it retains the realistic statistical properties (means, covariances among attributes) that one would see in living human respondents. The MNL model that underpins statistical analysis of MaxDiff experiments has strong assumptions that we can use to apply a theoretically appropriate pattern of random response error (one that conforms to the Gumbel distribution). Moreover, we can add the right amount of response error by making sure our artificial respondents have about the same level of test-retest reliability we see in human respondents. The final step in generating the artificial responses is to design the Express and Sparse questions and then have the artificial 1

respondents make their choices. The first study featured W&W s original method for making the Express MaxDiff design while the second used SSI Web s constructed list capabilities to generate a random subset of attributes for each respondent s Express MaxDiff design. In the first study artificial respondents chose the highest and lowest (utility + positively-skewed Gumbel error) alternatives from each of the MaxDiff questions. The second study applied the errors even more carefully and used positively-skewed Gumbel error for the best choices and negatively-skewed Gumbel error for the worst choices. Doing all this involved using some fancy perl script and the data generator capability in SSI Web. The final step, running MaxDiff HB analysis on both resulting data sets, allows us to take the "true" utilities from the actual respondent data and then see how the Sparse and MaxDiff would be able to recreate those true utilities after being perturbed by random error and run through the two experimental designs. Results At the aggregate level (looking at mean HB utilities) we can compute the correlation of the mean utilities calculated via each Express and Sparse MaxDiff with the mean true utilities (from human respondents) and we can test the difference in these correlations using a t-test for differences in dependent correlations (Cohen and Cohen 1983). In both studies the correlation between known and estimated aggregate utilities shows that the two methods perform very similarly: 0.993 for Sparse and 0.985 for Express in the first study and 0.996 for Sparse and 0.992 for Express in the second. While trivially small and of no practical value, these differences are statistically significant (p<0.001). If you only need mean utilities, either method works great. Often, however, aggregate utilities will not allow us to meet a study s objectives and we need to rely upon the quality of respondent level utilities (for e.g. simulations, TURF, segmentation). To assess the quality of respondent level utilities we can calculate the correlations of Express and Sparse HB utilities with the known utilities for each respondent; with paired correlations we can run a dependent t-test for means as we could with any other paired variables. Looking at the correlations at the respondent level the average correlations across respondents of Sparse/Express with actual utilities were 0.752/0.684 for the first study and 0.855/0.802 for the second. In this case the differences between Sparse and Express in both studies are large enough to be meaningful as well as highly significant (p<.001). For studies requiring respondent level utilities, Sparse appears to be a better way to go than Express (though both approaches involve sacrificing individual-level precision compared to typical MaxDiff studies wherein each respondent sees each item multiple times). Discussion In the Sawtooth Software User Group discussion on LinkedIn that followed the overview of this research and in private replies to the post, several folks made useful observations and suggestions, among them: 2

First, the RLH statistic that some analysts use as a basis for determining respondent quality (and potentially for excluding respondents from reporting) will be more reliable with Express than with Sparse. Joel Anderson had a clever suggestion: he thought that, better than using artificial respondents one could just take the data from actual respondents and discard a random 2/3 of their choices, effectively giving them something between a Sparse and an Express design. Joel found that the method reproduced the utilities from the full data set very well. A redo of this experiment with very controlled selection of the choice sets to discard might well allow a rigorous test of Express and Sparse MaxDiff using data from human respondents, something well worth sharing at one of our conferences (hint, hint). Tom Eagle pointed out that measurement theory might suggest that Express would outperform Sparse: For a given item Express MaxDiff gets really good information on 30 out of (say) 90 items and imputes the value of the other 60 items while Sparse MaxDiff may get poor measurements of all 90 items. Several people noted that artificial respondents really are not real humans. Of course this is true: no matter how lovingly and realistically we craft artificial respondents, and even if we add theory-driven amounts and distributions of random error to their responses, we simply cannot guarantee that what we find with artificial respondents will generalize to human respondents (For example, real humans may benefit from seeing items multiple times, because when they have seen an item before they may spend less cognitive effort in subsequent viewings (as opposed to seeing each item just once fresh and new): hence the value of experiments along the lines that Joel Anderson recommends. If you like this discussion, you may also want to check out the paper Bandit Adaptive MaxDiff Designs for Huge Numbers of items at the 2015 Sawtooth Software Conference in March 2015 in Orlando, Florida (Fairchild et al. 2015). That paper focuses on finding the top items in a long list (one of their examples includes 300 items!) rather than providing quality utilities for all the items in a study at the individual respondent level. Interestingly enough, their simulation results also suggest a slight edge in performance for Sparse over Express MaxDiff. 3

References Cohen, J and P. Cohen (1983) Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences, 2nd ed. Hillsdale: Lawrence Erlbaum. Fairchild, Kenneth, Bryan Orme, and Eric Schwartz (2015), Bandit Adaptive MaxDiff Designs for Huge Number of Items, 2015 Sawtooth Software Conference Proceedings, (forthcoming). Wirth, R. and A. Wolfrath (2012) Using MaxDiff for Evaluating Very Large Sets of Items, 2012 Sawtooth Software Conference Proceedings, pp. 59-78. 4