CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems

Similar documents
CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems

Cognitive Modeling. Empirical Research Methods. Ute Schmid. Kognitive Systeme, Angewandte Informatik, Universität Bamberg

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions

9 research designs likely for PSYC 2100

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Still important ideas

Psychology Research Process

Readings: Textbook readings: OpenStax - Chapters 1 11 Online readings: Appendix D, E & F Plous Chapters 10, 11, 12 and 14

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions

Evaluation: Controlled Experiments. Title Text

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Still important ideas

One-Way ANOVAs t-test two statistically significant Type I error alpha null hypothesis dependant variable Independent variable three levels;

Designing Psychology Experiments: Data Analysis and Presentation

Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F

Business Statistics Probability

Quantitative Methods in Computing Education Research (A brief overview tips and techniques)

Introduction to statistics Dr Alvin Vista, ACER Bangkok, 14-18, Sept. 2015

Doing Quantitative Research 26E02900, 6 ECTS Lecture 6: Structural Equations Modeling. Olli-Pekka Kauppila Daria Kautto

Collecting & Making Sense of

Readings: Textbook readings: OpenStax - Chapters 1 4 Online readings: Appendix D, E & F Online readings: Plous - Chapters 1, 5, 6, 13

Research Landscape. Qualitative = Constructivist approach. Quantitative = Positivist/post-positivist approach Mixed methods = Pragmatist approach

Sample Exam Questions Psychology 3201 Exam 1

Study Guide for the Final Exam

UNIVERSITY OF THE FREE STATE DEPARTMENT OF COMPUTER SCIENCE AND INFORMATICS CSIS6813 MODULE TEST 2

PRINCIPLES OF STATISTICS

The following are questions that students had difficulty with on the first three exams.

CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA

Prepared by: Assoc. Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies

Types of questions. You need to know. Short question. Short question. Measurement Scale: Ordinal Scale

Understanding and serving users

Empirical Research Methods for Human-Computer Interaction. I. Scott MacKenzie Steven J. Castellucci

Measuring the User Experience

Political Science 15, Winter 2014 Final Review

Results & Statistics: Description and Correlation. I. Scales of Measurement A Review

Evaluation: Scientific Studies. Title Text

POLS 5377 Scope & Method of Political Science. Correlation within SPSS. Key Questions: How to compute and interpret the following measures in SPSS

Samples, Sample Size And Sample Error. Research Methodology. How Big Is Big? Estimating Sample Size. Variables. Variables 2/25/2018

Designing Psychology Experiments: Data Analysis and Presentation

Who? What? What do you want to know? What scope of the product will you evaluate?

investigate. educate. inform.

CHAPTER VI RESEARCH METHODOLOGY

Basic Statistical Concepts, Research Design, & Notation

Experimental Research in HCI. Alma Leora Culén University of Oslo, Department of Informatics, Design

Psychology Research Process

STATISTICS AND RESEARCH DESIGN

Psychology 205, Revelle, Fall 2014 Research Methods in Psychology Mid-Term. Name:

Empirical Knowledge: based on observations. Answer questions why, whom, how, and when.

CHAPTER NINE DATA ANALYSIS / EVALUATING QUALITY (VALIDITY) OF BETWEEN GROUP EXPERIMENTS

EXPERIMENTAL RESEARCH DESIGNS

Analysis A step in the research process that involves describing and then making inferences based on a set of data.

Collecting & Making Sense of

Ch. 11 Measurement. Measurement

HPS301 Exam Notes- Contents

SPRING GROVE AREA SCHOOL DISTRICT. Course Description. Instructional Strategies, Learning Practices, Activities, and Experiences.

Designing Experiments... Or how many times and ways can I screw that up?!?

HOW STATISTICS IMPACT PHARMACY PRACTICE?

Chapter 11. Experimental Design: One-Way Independent Samples Design

AS Psychology Curriculum Plan & Scheme of work

Introduction to Statistical Data Analysis I

Examining differences between two sets of scores

9.63 Laboratory in Cognitive Science

Ch. 11 Measurement. Paul I-Hai Lin, Professor A Core Course for M.S. Technology Purdue University Fort Wayne Campus

Validity and Reliability. PDF Created with deskpdf PDF Writer - Trial ::

Statistical analysis DIANA SAPLACAN 2017 * SLIDES ADAPTED BASED ON LECTURE NOTES BY ALMA LEORA CULEN

CHAPTER 4 THE QUESTIONNAIRE DESIGN /SOLUTION DESIGN. This chapter contains explanations that become a basic knowledge to create a good

PTHP 7101 Research 1 Chapter Assignments

Modeling Visual Search Time for Soft Keyboards. Lecture #14

Chapter 7. Marketing Experimental Research. Business Research Methods Verónica Rosendo Ríos Enrique Pérez del Campo Marketing Research

Chapter 1: Introduction to Statistics

Controlled Experiments

The Logic of Data Analysis Using Statistical Techniques M. E. Swisher, 2016

CHAPTER 3 RESEARCH METHODOLOGY

In this chapter we discuss validity issues for quantitative research and for qualitative research.

VARIABLES AND MEASUREMENT

BIOL 458 BIOMETRY Lab 7 Multi-Factor ANOVA

Variables and Scale Measurement (Revisited) Lulu Eva Rakhmilla, dr., M.KM Epidemiology and Biostatistics 2013

Human-Computer Interaction IS4300. I6 Swing Layout Managers due now

Measures. David Black, Ph.D. Pediatric and Developmental. Introduction to the Principles and Practice of Clinical Research

Topic #6. Quasi-experimental designs are research studies in which participants are selected for different conditions from pre-existing groups.

Review and Wrap-up! ESP 178 Applied Research Methods Calvin Thigpen 3/14/17 Adapted from presentation by Prof. Susan Handy

copyright D. McCann, 2006) PSYCHOLOGY is

Unit 1 Exploring and Understanding Data

Figure: Presentation slides:

Lecturer: Rob van der Willigen 11/9/08

Chapter 1: Introduction to Statistics

32.5. percent of U.S. manufacturers experiencing unfair currency manipulation in the trade practices of other countries.

Lecturer: Rob van der Willigen 11/9/08

Formulating Research Questions and Designing Studies. Research Series Session I January 4, 2017

Introduction to Research Methods

LAB ASSIGNMENT 4 INFERENCES FOR NUMERICAL DATA. Comparison of Cancer Survival*

Classical Psychophysical Methods (cont.)

Standard Deviation and Standard Error Tutorial. This is significantly important. Get your AP Equations and Formulas sheet

Basic Steps in Planning Research. Dr. P.J. Brink and Dr. M.J. Wood

Human Computer Interaction - An Introduction

SECTION A. You are advised to spend at least 5 minutes reading the information provided.

Chapter 5: Research Language. Published Examples of Research Concepts


Transcription:

CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems Human Computer Interaction Ute Schmid Applied Computer Science, Bamberg University last change May 8, 2007 CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 1

Why and When Empirical Methods? User-centred approach to design: design based on real data not on imagination abour users what kind of activities needed support User studies at the beginning: collect data, analyse, abstract, build system Support of system evaluation: user test in controlled lab conditions or real surrounding CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 2

Empirical Research Methods Research Methods: soft skills How to generate a operational hypothesis e.g., x 1 < x 2 Number of errors in test group (with help menu) is significantly smaller than number of errors in control group (without help menu) starting point: a verbally described hypothesis (derived from some theory of human information processing) Take care about scale niveau of data, select appropriate test Present results of inference statistics in combination with descriptive charts Knowledge about statistics necessary but: focus on the general procedure of empirical research see: Jürgen Bortz, Lehrbuch der empirischen Sozialforschung CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 3

Kinds of Empirical Studies Case-study vs. random sample Look at a small number of people in detail (typically qualitative data from log-files, other protocolls) Use a representative sample of users (typically in experimental settingsi Single shot vs. longitudinal general characteristics vs. learnability questions special problems of longitudinal studies Regression to the mean In natural setting or in laboratory question of external validity Quasi-experimental or experimental CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 4

What is an Experiment? Wilhelm Wundt (1886): Das Experiment besteht in einer Beobachtung, die sich mit der willkürlichen Einwirkung des Beobachers auf die Entstehung und den Verlauf der zu beobachtenden Erscheinung verbindet. (1908) additionally: Wiederholbarkeit and Variierbarkeit Willkürlichkeit: assign a treatment (independent variable) to a probe/subject Wiederholbarkeit: experiment can be performed at another time (another place, with other probes/subjects) leading to the same results (dependent variable) Variierbarkeit: treatment can be assigned deliberatly and systematically CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 5

Example Word superiority effect (Wheeler, 1970) Treatment: Subjects are randomly assigned to one of the following conditions: (a) short presentation of a letter (e.g. d ), (b) presentation of a word (e.g. word ) Afterwards: recognition (a) either d or k, (b) either word or work Dependent variable: recognition errors Result, significantly less errors in condition (b) CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 6

Quasi-Experiment Some characteristics are invariably correlated with the subject (no random assignment of treatment) Examples: gender, intelligence quotient, color of a person Example: Patients in single bed rooms vs. multi-bed hospital rooms Dependent variable: days until discharge Problem: unrecognized factors correlated with the dependet variable (single rooms are more sunny, doctors take more time with patients in single rooms etc.) CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 7

Experiment and Causal Explanation If a treatment can be assigned randomly to subjects, we can assume that all other factors which might influence the outcome of the experiment are distributed randomly over the subjects In experiments, we can test hypotheses of effects (differences between different treatments, typically including a control group without treatment) Hypotheses about correlations do not allow causal explanations, they only allow statements about the kind and intensity of the co-variation of variables! CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 8

Internal/External Validity Internal validity: no (not many) alternative explanations for results (effect can be ascribed to treatment) External validity: Generalizability of results from the experimental setting to a larger domain (necessary: representativeness of sample) CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 9

Types of Statistics Descriptive Statistics: Means and standard deviations of data (kind of measure depends on scale niveau, e.g. median vs. arithmetic middle) Inference Statistics: Generalize from sample to population with a certain amount of error (e.g. analysis of variance, chi-square,...) Data Aggregation: cluster analysis, multidimensional scaling, principal component analysis CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 10

Presentation of Results Example: RT ms 1750 1500 1250 1000 not neg neg ANOVA F(1, 31) = 17.09,p <.01 CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 11

Testing of Hypotheses about Differences general case: analyses of variance (ANOVA) special case: two independent treatments, t-test Null-Hypothesis (no effect of treatment) vs. alternative hypothesis (effect of treatment) Inference statistics: tests whether the result can occur if it is assumed that the null-hypothesis holds in the population (from which we tested a sample), tests whether differences in the mean of the dependent variable are significant Significance: with an error (typically 5 or 1 percent) we can conclude that the null hypothesis does not hold CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 12

Accept/Reject Hypothesis H 1 : x 1 > x 2 H 0 : x 1 x 2 = 0 CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 13

Two Kinds of Errors In population H 0 H 1 In sample H 0 correct dec. β error H 1 α error correct dec. CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 14

t-test for Independent Samples H 0 : the means of two populations which are normally distributed (Gauss) and with homogenuous variances are not different (µ 1 = µ 2 ) Specific (directed) alternative hypothesis: µ 1 > µ 2 If samples of size n are drawn from a normally-distributed population, the sample means are t-distributed (with n 1 degrees of freedom because n 1 means determine the last mean) t = x 1 x 2 (µ 1 µ 2 ) ˆσ x1 x 2 CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 15

t-test cont. Because null hypothesis µ 1 µ 2 = 0 t = x 1 x 2 ˆσ x1 x 2 Because population variance is unknown, it is estimated from the sample ˆσ x1 x 2 = n1 i=1 (x i1 x 1 ) 2 + n 2 i=1 (x i2 x 1 ) 2 (n 1 1) + (n 2 1) ( 1 n 1 + 1 n 2 ) CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 16

Analysis of Variance More general than t-test For different designs one factor with arbitrary number of groups e.g. factor with 4 groups source of noise: sea and wind, street, construction work, airport multi-factorial with repeated measurement F-distribution: H 0 : no difference in the means, i.e. variance of means is zero H 0 : µ 0 = µ 1 = µ p H 0 : σ µ = 0 H 1 : σ µ = c Degrees of freedom (df) for one factor: p 1 CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 17

Multifactorial Analysis of Variance 5 4 3 2 1 male female E.g., two-factorial: factor A with p steps, factor B with q steps Interaction: A B Example (quasi-experimental): X Dependent variable: attitude to a new law for parental leave (rating 5 to +5) Factor A Political orientation (3 groups), Factor B Gender (2 groups) Interaction: gender influence differs for the different political groups 0 1 G1 G2 G3 X 2 X 3 4 5 CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 18

Chi-Square Tests For categorial data e.g. two independent variables H 0 : An attribute with k values is independent of an attribute with l values Example: Frequency of depression and schizophrenia is distributed differently in different social groups CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 19

Which Kind of Statistical Test Scale niveau of data Distribution of measurements Sample Size CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 20

Scale Niveau Nominal Ordinal: rank order Intervall: determined up to shift of zero and scaling αx + β Rational: determined zero Absolute CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 21

Experimental Ethics Trade off between scientific progress and human dignity (e.g., threshold of pain, learned helplessness,...) Obligation to inform subjects Free decision to participate/quit participation Anonymity of results Experiments in social psychology (e.g. Milgram) vs. in cognitive psychology CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 22

Report of Empirical Studies Highly standardized General structure Introduction Theoretical Background Experiment/Study General Discussion CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 23

Introduction Section name: Introduction Give a short motivation for your study What is the general empirical question? Relevance of the question Need for the study/experiment Advanced organizer for rest of paper CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 24

Theoretical Background Section-name: specific for your study E.g.: Color Perception and Interpretation of Diagrams Describe state of research related to your study Argue why a (further) study is needed to give more precise information Formulate the hypothesis in a general way and explain how you derived this hypothesis CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 25

Experiment Section name: Experiment or Empirical Study Subsections Design (operational hypothesis and resulting design: independent and dependent variables) Method Participants Material Procedure Results Discussion CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 26

Participants The study was conducted in May 2004 with students from Bamberg university. 42 students participated in the study (23 male, 19 female, average age 21.3, sd = 2.4). 2 subjects were excluded from the analysis because of missing values. CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 27

Material Four charts were drawn with XX (see fig. 1 to 4). All charts depicted the same imaginary technical device, called Megafux. Two charts represented temperatur distribution and two charts distribution of frequencies (factor: content type). For each content type, factor representation type was varied such that one chart color coded the distributed value directly on the device and the other chart gave values in a line graph. For each chart, the subjects had to answer one question The temperature/frequence is higher on the upper part than on the lower part of the Megaflux: yes/no. The information was presented in a webbrowser via http. The dependent variables were obtained via keystrokes using php. CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 28

Procedure Subjects were tested individually. One experminental session took about five minutes time. First subjects read a webpage giving general instructions. Afterwards, the question was presented. The subject clicked the OK button if he/she was sure that he/she understood the question. Then a chart appeared. The subjects now clicked either yes or no and the answer together with the time from presentation of the chart until mouse click were stored. CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 29

Results Give descriptive statistics, such as: overall number of correct answers in percent, overall mean and standard deviation of times. Present results for hypothesis in a bar or line graph and give the results of the statistical test If you have controlled some possible extraneous variables, such as position of yes/no button, (gender, age), report their influence CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 30

Discussion Only now relate the findings to your hypothesis Is your hypothesis confirmed by the results? Are the alternative explanations for your result? If your hypothesis could be not confirmed, have you hypotheses about additional factors which have influenced the results? a more specific hypothesis? CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 31

General Discussion Give a short summary of your hypothesis and the main result What are the conclusions of the results (e.g. implication for information presentation) Which follow-up studies/experiments should be conducted? CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 32

User Study Methods Remember: user-centered approach to design Different data collection methods Behavioral data: solution times, errors (high quality) Subjective data: extracted from from interviews, observations, questionnaires CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 33

Interviews Can/should be conducted very early in design (no prototype available) unstructured (if you have really no idea) vs. structured Points to be covered Explain the purpose of the interview Enumerate activities to be supported e.g., form general questions to more specific questions Explore work methods (hardest part, may be difficult to understand for an outsider) Tracing interconnections uncover performance issues CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 34

Analysis of Interview Data Recording of interviews get transcribed Extract categories for the different aspects of users activities There are some software tools for interview analysis http://www.qualitative-research.net If necessary: try to make analysis more objective by using different raters which assign text segments to categories and evaluate interrater reliability CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 35

Observations Best used to observe task performance of users Best not interrupt users activities (non-reactive approach) Data collection via Video Logfiles Thinking aloud CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 36

Data from Observations As in interviews: qualitative data Be aware of Hawthorne effect: workers productivity might increase simply as response to being studied Most important is a careful design of tasks to be studied (representative, varying from routine to special,...) Usability criteria should be translated into dependent variables! CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 37

Questionnairs Best used to assess additional aspects of system use which cannot be derived more directly E.g.: Subjective impressions as satisfaction, feeling in control Questionaire design issues Items should be unambiguous Answer modalities: yes/no, multiple choice, rating scales, open answers Test theoretical questions: reliability, validity For tests which are developped for a broader application! CogSysIII Lecture 4/5: Empirical Evaluation of Software-Systems p. 38