Evaluation: Controlled Experiments Title Text 1
Outline Evaluation beyond usability tests Controlled Experiments Other Evaluation Methods 2
Evaluation Beyond Usability Tests 3
Usability Evaluation (last week) Expert tests / walkthroughs Usability Tests with users Main goal: formative identify usability problems improve the tool 4
Summative Evaluation (focus today) How good is it? Useful? Better than other tools? 5
Formative and Summative: Usually combined formative summative Evaluation over time 6
Evaluation goals (summative) Generalizability Results can be applied to other people Precision We measured what we wanted to measure (controlling factors that were not intended to study) Realism Study context is realistic... usually trade-off between them! 7
McGrath / Carpendale The selection of a research method depends on the research question and the object under study! 8
Controlled Experiments 9
Controlled experiment Or: Laboratory Experiment Lab study User Study A/B Testing (used in marketing) 10
Focus Precision Generalizability (?) Overall goal Reveal cause-effect relationships e.g. smoking causes cancer 11
Scenario A B Which is better? 12
Test it with users! Carpendale 13
Hypothesis A precise problem statement Example: H 1 = Participants will buy more beer when using variant B than variant A Null-Hypothese H 0 = no difference in beer purchase A B 14
Independent Variables Factors to be studied Typical independent variables (in HCI) Different types of design Task type: e.g., searching/browsing Participant demographics: e.g., male/female Different technologies: touch pad vs. keyboard Control of Independent Variable Levels: The number of variables in each factor Limited by the length of the study and the number of participants How different? Entire interfaces vs. very specific parts A B 15
Control Environment Make sure nothing else could cause your effect Control confounding variables Randomization! A B 16
Different Designs: Between-Subjects Divide the participants into groups, each group does one condition Randomize: Group Assignment Potential problem? Group 1 A Group 2 B 17
Different Designs: Within-Subjects Everybody does all the conditions Can account for individual differences and reduce noise (that s why it may be more powerful and requires less participants) Severely limits the number of conditions, and even types of tasks tested (may be able to workaround by having multiple sessions) Can lead to ordering effects > Randomize Order A B 18
Dependent Variable The things that you measure Performance indicators: task completion time, error rates, mouse movement (numbers of beers bought) Subjective participant feedback: satisfaction ratings, closed-ended questions, interviews questionnaires (HCI lecture last week) Observations: behaviors, signs of frustrations 19
Tasks Specifying good tasks for controlled experiments is tricky Specifically, if you are measuring performance criteria Task criteria comparability for different interfaces clear end point Example usability test: >>buy a book for a 4 year old<< controlled experiment: >>find and buy the book Doctor Faustus by Thomas Mann<< 20
Results: Application of Statistics Descriptive Statistics Describes the data you gathered (e.g. visually) Inferential Statistics Make predictions/inferences from your study to the larger population 21
Descriptive statistics Central tendency mean {1, 2, 4, 5} median {15, 19, 22, 29, 33, 45, 50} mode {12, 15, 22, 22, 22, 34, 34} 22
Descriptive statistics Central tendency mean {1, 2, 4, 5} 3 median {15, 19, 22, 29, 33, 45, 50} 29 mode {12, 15, 22, 22, 22, 34, 34} 22 23
Descriptive statistics Central tendency mean {1, 2, 4, 5} 3 median {15, 19, 22, 29, 33, 45, 50} 29 mode {12, 15, 22, 22, 22, 34, 34} 22 Measures of spread range variance standard deviation = = 24 note: for inferential standard deviation N becomes (N-1) > estimate for sampled population
Visualization of descriptive statistics e.g., Boxplot Mean 25/75% Quartiles Min / Max (alternative: with outliers) 25
Inferential statistics Goal: Generalize findings to the larger population http://www.latrobe.edu.au/psy/research/cognitive-and-developmental-psychology/esci 26
Excursus: Tragedy of the error bars CI = Confidence intervals SE = Standard Error (SD of the sampling distribution of the sample mean) SD = Standard Deviation 27
Excursus: 95% Confidence intervals USE THEM! Interpretation: We can be 95% confident that the real mean lies within our confidence interval! More intuition about stats: Seeing theory: http://students.brown.edu/seeing-theory/ 28
Null Hypothesis Testing Statistically significant results p <.05 The probability that we incorrectly reject the Null-Hypothesis (Type I error) Many different tests t-test, ANOVA, A B 29
Validity Errors: Type I: False positives Type II: False negatives External Validity Can we generalize the study? E.g. generalizable to the larger population of undergrad students Internal Validity Is there a causal relationship? Are there alternate causes? 30
Internal Validity: Storks deliver babies!? R. Matthews, Storks Deliver Babies. Journal of Teaching Statistics, vol. 22, issue 2, pages 36-38, 2001; There is a correlation coefficient of r=0.62 (reasonably high) A statistical test can be employed that shows that this correlation is in fact significant (p = 0.008) What are the flaws? 31
Pragmatically A step-by-step how-to 32
Experimental Procedure: Typical example Identify research hypothesis Specify the design of the study Think about statistics *before* you run the study Run a pilot study Recruit participants Run the actual data collection sessions Analyze the data Report the results 33
Experimental Procedure: Typical example Identify research hypothesis Specify the design of the study Think about statistics *before* you run the study Run a pilot study Recruit participants Run the actual data collection sessions Analyze the data Report the results 34
Run a pilot study to test the study design to test the system to test the study instruments 35
Recruit participants Reflecting the larger population? in the best case yes pragmatic decision though How many? Depends on effect size and study design--power of experiment Usually 15+ (per group) Note: much higher than for usability test (~5) 36
Run the actual data collection process System and instruments ready? Greet participants Introduce purpose of study and procedure or deliberately don t Don t bias: compare my interface vs. this other interface, Get consent of the participants ethics! Assign participants to specific experiment condition according to pre-defined randomization method Introduction to system(s) and/or training tasks Participants complete the actual tasks take measures of dependent variables Participants answer questionnaire (if any) Debriefing session Payment (if any). monetary, coupons, chocolate 37
Report the results Introduction / motivation Study design Results Discussion Conclusions References / Appendix See, for instance, Saul Greenberg s recommendation: http://pages.cpsc.ucalgary.ca/~saul/hci_topics/ assignments/controlled_expt/ass1_reports.html 38
Other Evaluation Methods 39
Field Studies Realism Reveal: a richer understanding by using a more holistic approach (Carpendale, 08) 40
Qualitative Methods Observation Techniques fly-on-wall techniques interruptions by observer Interview Techniques contextual? 41
Qualitative Methods as Add-on Often controlled experiment + Experimenter Observations Collecting Participants Opinions Think-Aloud Protocol (be careful!) Helpful for... Usability Improvement (cf. HCI last weeks) New insights, explanation of unforeseen results, new questions Can help to confirm results 42
Qualitative Methods as Primary Pre-design studies Rich understanding of a complex domain Problems, challenges, domain language During-, Post-design studies Case studies/ Field studies Helpful for... holistic understanding 43
Qualitative Methods as Primary In Situ Observations Participatory Observations Laboratory Observational Studies Contextual Interviews Focus Groups 44
Qualitative Challenges Sample Sizes Doing intensive studies with a lot of participants? Time? Data produced? Subjectivity Social relationship? Analyzing the data Grounded theory Open and axial coding 45
New Ways of Evaluation Mechanical Turk (more and more popular) Measuring brain activities... 46