Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments

Similar documents
Measurement and Descriptive Statistics. Katie Rommel-Esham Education 604

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

Interpreting the Item Analysis Score Report Statistical Information

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Still important ideas

VIEW: An Assessment of Problem Solving Style

VARIABLES AND MEASUREMENT

Business Statistics Probability

By Hui Bian Office for Faculty Excellence

On the purpose of testing:

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

Introduction to Reliability

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Readings: Textbook readings: OpenStax - Chapters 1 11 Online readings: Appendix D, E & F Plous Chapters 10, 11, 12 and 14

Students will understand the definition of mean, median, mode and standard deviation and be able to calculate these functions with given set of

Standard Scores. Richard S. Balkin, Ph.D., LPC-S, NCC

Still important ideas

Introduction to statistics Dr Alvin Vista, ACER Bangkok, 14-18, Sept. 2015

Influences of IRT Item Attributes on Angoff Rater Judgments

PRINCIPLES OF STATISTICS

3 CONCEPTUAL FOUNDATIONS OF STATISTICS

Appendix B Statistical Methods

Construct Reliability and Validity Update Report

Results & Statistics: Description and Correlation. I. Scales of Measurement A Review

C-1: Variables which are measured on a continuous scale are described in terms of three key characteristics central tendency, variability, and shape.

Measuring the User Experience

Lesson 9 Presentation and Display of Quantitative Data

Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions

CHAPTER ONE CORRELATION

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria

Analysis and Interpretation of Data Part 1

Standard Deviation and Standard Error Tutorial. This is significantly important. Get your AP Equations and Formulas sheet

Psychology 205, Revelle, Fall 2014 Research Methods in Psychology Mid-Term. Name:

Chapter 2 Norms and Basic Statistics for Testing MULTIPLE CHOICE


Statistics: Interpreting Data and Making Predictions. Interpreting Data 1/50

Reliability and Validity checks S-005

Chapter 7: Descriptive Statistics

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions

An Introduction to Research Statistics

15.301/310, Managerial Psychology Prof. Dan Ariely Recitation 8: T test and ANOVA

Ecological Statistics

SLEEP DISTURBANCE ABOUT SLEEP DISTURBANCE INTRODUCTION TO ASSESSMENT OPTIONS. 6/27/2018 PROMIS Sleep Disturbance Page 1

Welcome to OSA Training Statistics Part II

CHAPTER 2. MEASURING AND DESCRIBING VARIABLES

Applied Statistical Analysis EDUC 6050 Week 4

ABOUT PHYSICAL ACTIVITY

LANGUAGE TEST RELIABILITY On defining reliability Sources of unreliability Methods of estimating reliability Standard error of measurement Factors

Quizzes (and relevant lab exercises): 20% Midterm exams (2): 25% each Final exam: 30%

Item Analysis: Classical and Beyond

Inferential Statistics

Biostatistics. Donna Kritz-Silverstein, Ph.D. Professor Department of Family & Preventive Medicine University of California, San Diego

Empirical Knowledge: based on observations. Answer questions why, whom, how, and when.

MEANING AND PURPOSE. ADULT PEDIATRIC PARENT PROXY PROMIS Item Bank v1.0 Meaning and Purpose PROMIS Short Form v1.0 Meaning and Purpose 4a

Validity and reliability of measurements

Development, Standardization and Application of

Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys

PSYCHOLOGICAL STRESS EXPERIENCES

Georgina Salas. Topics EDCI Intro to Research Dr. A.J. Herrera

Chapter 20: Test Administration and Interpretation

The Psychometric Development Process of Recovery Measures and Markers: Classical Test Theory and Item Response Theory

Smoking Social Motivations

Psychometrics for Beginners. Lawrence J. Fabrey, PhD Applied Measurement Professionals

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

Introduction to Multilevel Models for Longitudinal and Repeated Measures Data

PÄIVI KARHU THE THEORY OF MEASUREMENT

ANXIETY A brief guide to the PROMIS Anxiety instruments:

Basic Statistics 01. Describing Data. Special Program: Pre-training 1

CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA

Item Analysis for Beginners

PHYSICAL STRESS EXPERIENCES

Unit 1 Exploring and Understanding Data

FATIGUE. A brief guide to the PROMIS Fatigue instruments:

Readings: Textbook readings: OpenStax - Chapters 1 4 Online readings: Appendix D, E & F Online readings: Plous - Chapters 1, 5, 6, 13

Theory. = an explanation using an integrated set of principles that organizes observations and predicts behaviors or events.

Statistical Significance, Effect Size, and Practical Significance Eva Lawrence Guilford College October, 2017

COGNITIVE FUNCTION. PROMIS Pediatric Item Bank v1.0 Cognitive Function PROMIS Pediatric Short Form v1.0 Cognitive Function 7a

Examining the Psychometric Properties of The McQuaig Occupational Test

How Lertap and Iteman Flag Items

Practice-Based Research for the Psychotherapist: Research Instruments & Strategies Robert Elliott University of Strathclyde

Never P alone: The value of estimates and confidence intervals

ABOUT SMOKING NEGATIVE PSYCHOSOCIAL EXPECTANCIES

Designing Psychology Experiments: Data Analysis and Presentation

Observer OPTION 5 Manual

Stats 95. Statistical analysis without compelling presentation is annoying at best and catastrophic at worst. From raw numbers to meaningful pictures

INTRODUCTION TO ASSESSMENT OPTIONS

Chapter 2--Norms and Basic Statistics for Testing

EVALUATING AND IMPROVING MULTIPLE CHOICE QUESTIONS

Collecting & Making Sense of

Analysis of the Reliability and Validity of an Edgenuity Algebra I Quiz

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses

Do not write your name on this examination all 40 best

HS Exam 1 -- March 9, 2006

Techniques for Explaining Item Response Theory to Stakeholder

A Comparison of Several Goodness-of-Fit Statistics

Introduction to Multilevel Models for Longitudinal and Repeated Measures Data

ISC- GRADE XI HUMANITIES ( ) PSYCHOLOGY. Chapter 2- Methods of Psychology

Transcription:

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments Greg Pope, Analytics and Psychometrics Manager 2008 Users Conference San Antonio

Introduction and purpose of this session This session will present some of the theory behind assessment analysis to put the numbers into context The discussion will be as non-technical as possible with a more applied approach Questionmark tools that can help evaluate the performance of assessments and assessment items will be presented with applied examples If you have questions please do ask during the session Slide 2

Agenda A brief review of the theory Putting theory into practice, some Questionmark tools Score List Report, Coaching Report, Transcript Report Item analysis report Test analysis report Results Management System (RMS) Summary Question and answer period Slide 3

CTT and IRT, what's the diff? CTT (Classical Test Theory) is what we all know and love: P-values Discrimination statistics (point-biserial correlations, high minus low performance, etc.) Been around a long time (most of the 20 th century) Works very well for most applications and by far the most widely used form of item/test analysis Ability to work with smaller sample sizes (e.g., 150-200 or less) Relatively simple to compute (not fitting data to a model) Has a different set of assumptions from IRT Slide 4

CTT and IRT, what's the diff? IRT (Item Response Theory) is an alternative that some of us may have heard of or use: a-parameter: Item discrimination b-parameter: Item difficulty c-parameter: Item pseudo-guessing information Been around since the 1960s (Lord) Makes things like computer adaptive testing (CAT) and advanced test development techniques possible More complex to compute (fitting data to a model) Requires larger sample sizes depending on the number of parameters, more parameters more participant responses needed (e.g., 700+ for 3-PL) More information: http://edres.org/irt/ Slide 5

CTT and IRT CTT item analysis IRT item analysis: Item characteristic curve (ICC) Slide 6

CTT and IRT What is better CTT or IRT? Each are used for their own purposes and have pros and cons Questionmark currently uses CTT in our products Flexible in terms of sample sizes Fast to compute with few or no computational gotchas People are familiar with these statistics and so do not require learning a new measurement model CTT meets the needs of 99% or more of customers CTT statistics are related to IRT statistics to some degree P-values are highly correlated with b-values Point-biserial correlations are highly correlated with a-values We will be discussing CTT today Slide 7

Reliability Reliability is used in every day language: My car runs reliably means it starts very time We are going to be talking about test score reliability Essentially: How consistently the test scores measure a construct We can t go into all the detail here today, for a good primer into the theory see: Traub, R.E. (1994). Reliability for the Social Sciences: Theory & Applications. Thousand Oaks: Sage. Slide 8

Reliability (briefly, the theory) An assessment is a measurement instrument comprised of many individual measurements (questions/items) What is being measured is the ability, trait, construct, latent variable of interest (a massage therapy certification exam may measure massage knowledge/skills, an investment banking test may measure the construct knowledge of investment banking ) All measurement instruments have error in their estimates, so the traditional view of test score reliability says that a person s: observed score = theoretical True score + error Slide 9

Measurements and error 78.2 Measurements made by a thermometer of temperature are imperfect (atmospheric variables, sunlight, etc.) To mitigate this take lots of measurements using different, high quality thermometers 78.9 77.5 78.1 78.7 78.0 77.9 78.4 Slide 10

Measurements and error Q1 Measurements made by a test question of a construct are imperfect (participant fatigue, psychological variables, etc.) To mitigate this take lots of measurements using different, high quality questions Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Slide 11

Reliability Four approaches for measuring reliability: 1. Internal consistency: Correlations of items comprising the test (how well do they hang together ) 2. Split-half (split-forms): Correlation of two forms (splits) of the test (first 25 items versus last 25) 3. Test-retest: Correlation between multiple administrations of the same test 4. Inter-rater reliability: Correlation between two or more raters (markers) who rate the same thing (e.g., provide essay scores) Slide 12

Reliability: Internal consistency Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Slide 13

Reliability: Internal consistency Kuder-Richardson Formula 20 (KR-20) First published in 1937 Designed for dichotomous (1/0, right/wrong) items Values range from 0 to 1 (closer to 1 = higher reliability) Cronbach s Alpha Cronbach published in 1951 Designed for dichotomous and non-dichotomous (continuous, 1 to 5) items Generally values range from 0 to +1 (closer to +1 = higher reliability) Questionmark uses the Cronbach s Alpha on the Test Analysis Report and Results Management system Greater than 0.90 (high - acceptable for high stakes), 0.70 to 0.89 (moderate - acceptable for medium stakes), below 0.7 (low) Slide 14

Reliability practicalities What factors / test characteristics generally influence reliability coefficient values? Item difficulty: Items that are extremely hard or extremely easy affect discrimination and therefore reliability. If a large number of participants do not have time to finish the test this affects item difficulty Item discrimination: Items that have higher discrimination values will contribute more to the measurement efficacy of the assessment (more discriminating questions = higher reliability) Construct being measured: If all questions are measuring the same construct (e.g., from the same topic) reliability will be increased How many participants took the test: With very small numbers of participants the reliability value will be less stable How many questions are administered: Generally the more questions administered the higher the reliability Slide 15

Reliability and validity Validity (test scores) refers to: Whether the test is measuring what it should be measuring The processes followed to create the test and test questions That experts have had a chance to review and rubber stamp the processes That the test results predict the intended outcomes That the scores are use appropriately So validity is not a number it has to do with following best practices, conducting studies and research, using results fairly, etc. In order for an assessment to be valid it must be reliable Slide 16

Bringing the theory home Reliability and validity refer to the quality of tests and test items which translates into test scores Conducting analyses (analytics) on the test and test items will determine how well the questions are performing and how well the participants understood the material Understanding how to use the analytic tools at your disposal will help ensure high quality tests Lets get into some specifics Slide 17

Providing meaningful scores to participants One of the most important aspects of the assessment process has to do with providing meaningful scores to participants Depending on the stakes/purpose of your assessment program appropriate feedback to stakeholders (one of which is generally the participant) is crucial Many here are likely familiar with the Score List Report, Coaching Report, and Transcript Report so we won t spend time on this here today Slide 18

Score list report Slide 19

Coaching Report Slide 20

Transcript report Slide 21

Assessment and item quality The scores that are reported to participants and other stakeholders must be derived from high quality questions In order for the participant to obtain meaning and achieve learning from assessment results the assessments must measure what they are supposed to measure reliably Two core aspects of question quality has to do with the analysis of difficulty and discrimination Slide 22

Item difficulty and discrimination Item difficulty: P-value: Proportion of participants selecting the correct response or raw score converted to a percentage For a true/false (scored 1/0) question (true is the right answer) 0.650. 65.0% of participants selected True For a 0-5 question if the mean on the question is 3.75/5 then the p- value is 0.750 (3.75/5=0.750). The average score on the question is 75% Item discrimination: Point-biserial correlation (item total correlation): The correlation between the question scores and overall assessment scores for all participants Outcome discrimination: The Upper group minus the Lower group Slide 23

Item difficulty How a participant responds to a question says something about what they know and can do Question difficulty has to do both with the question and the participant This question is easy because lots of participants selected the correct answer This participant got this question right and so they have demonstrated knowledge/skills at this level Question difficulty is on a scale that is related to participant ability (knowledge/skills) Slide 24

Item discrimination Discrimination refers to how well an item discriminates/differentiates between participants of different knowledge/skill levels Experts in an area should get higher scores on the question and higher scores on the overall assessment Novices in the same area should get lower scores on the question and lower scores on the overall assessment Slide 25

Using the Item Analysis Report Composed of several sections: Information section Item difficulty (p-value) histogram and Item discrimination (outcome discrimination) histogram Question by question detailed analysis Summary information Information section provides details regarding when the report was created, etc. Slide 26

Using the Item Analysis Report Summary information (at the bottom of the report) Provides a summary of the average p-value, discrimination and item total correlation Slide 27

Using the Item Analysis Report Item difficulty (p-value) histogram and Item discrimination (outcome discrimination) histogram Provides a summary of the # of item by difficulty and discrimination Some harder Most items in the average difficulty range Some easier Some worse Most items have good discrim Some better Slide 28

Using the Item Analysis Report the better too hard? The Question Item by question detailed analysis more difficulty, Point- Provides a detailed analysis of each question biserial correlation Slide 29 Ran out of time? Not enough high, too many low? A lot of people thought these were the correct answers, is this being taught properly or are there item wording problems?

Using the Item Analysis Report reflects Hard question Correlation the high/low split Lower # Lots of high, no low (great!) Alternatives are pulling more of the low than the high, all are pulling some people Slide 30

Using the Test Analysis Report Composed of several sections: Information section Table of test statistics Topic level statistical breakdown Frequency distribution Histogram Information section provides details regarding when the report was created, etc. Slide 31 Remember, reducing sample size reduces measurement precision

Using the Test Analysis Report Table of test statistics and Topic level statistical breakdown Provides the statistical details for the overall assessment as well as at the topic levels Slide 32

Skew A measure of the symmetry of the distribution of scores (i.e., whether scores are pushed or skewed to one side or the other). Ranges from about -2 to +2 Negative skew Normal distribution (no skew) Positive skew # o f p a r t i c i p a n t s 20 18 16 14 12 10 8 6 4 2 # o f p a r t i c i p a n t s 20 18 16 14 12 10 8 6 4 2 # o f p a r t i c i p a n t s 20 18 16 14 12 10 8 6 4 2 0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Participant scores 0 0% 20% 40% 60% 80% 100% Participant scores 0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Participant scores Slide 33

Negative Skew # o f p a r t i c i p a n t s 20 18 16 14 12 10 8 6 4 2 0 Negative skew 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Participant score s Slide 34

No Skew # o f p a r t i c i p a n t s 20 18 16 14 12 10 8 6 4 2 0 Normal distribution (no skew) 0% 20% 40% 60% 80% 100% Participant score s Slide 35

Positive Skew # o f p a r t i c i p a n t s 20 18 16 14 12 10 8 6 4 2 0 Positive skew 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Participant score s Slide 36

Kurtosis ( n n( n 1)( n 1) 2)( n this to impress your friends A measure of the symmetry and family of a distribution of scores (i.e., how peaked/pointed versus flat are the distribution of scores what is happening at the tails). Normal range from about -3 to +3 3) It is important to memorize z 4 3( n 1) ( n 2)( n 2 3) Negative kurtosis (flat: platykurtic) Normal distribution (zero kurtosis: mesokurtic) Positive kurtosis (pointed: leptokurtic) # o f p a r t i c i p a n t s 18 16 14 12 10 8 6 4 2 # o f p a r t i c i p a n t s 20 18 16 14 12 10 8 6 4 2 # o f p a r t i c i p a n t s 20 18 16 14 12 10 8 6 4 2 0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Participant scores 0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Participant scores 0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Participant scores Slide 37

Positive (pointed) Kurtosis Positive kurtosis (pointed: leptokurtic) # o f p a r t i c i p a n t s 20 18 16 14 12 10 8 6 4 2 0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Participant score s Slide 38

Negative (nearly flat) Kurtosis Negative kurtosis (flat: platykurtic) # o f p a r t i c i p a n t s 18 16 14 12 10 8 6 4 2 0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Participant score s Slide 39

Zero (normal) Kurtosis Normal distribution (zero kurtosis: mesokurtic) # o f p a r t i c i p a n t s 20 18 16 14 12 10 8 6 4 2 0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Participant score s Slide 40

Mean (arithmetic) The most commonly used measure of central tendency (refers to the middle of a distribution of scores) Range of values depends on scores Slide 41

Median Another measure of central tendency, less sensitive than the mean to outliers Range of values depends on scores Where 50% of participants obtained higher scores and 50% of participants obtained lower scores Slide 42

Mode A third measure of central tendency, used a great deal in survey analysis The most common score in a distribution of scores Range of values depends on scores 34% 43% 56% 56% 56% 63% 67% 76% 88% Mode = 56% Slide 43

Standard deviation Participant s scores minus the mean gives a sense of the spread/variation The spread or variation of scores between participants of test scores Are the scores spread out (e.g., 0 to 100%) or clustered together (e.g., all scores between 55 and 62%) Range of values depends on scores Rick s test score = 75% Sally s test score = 83% Mark s test score = 53% Ella s test score = 91% Standard deviation = 16.36% Slide 44

Variance Another measure of variation The first step in calculating a standard deviation: The standard deviation is the square root of the variance Range of values depends on scores Used in some advanced calculations (e.g., Analysis of variance: ANOVA, Multiple analysis of variance: MANOVA) Slide 45

Standard Error of measurement The "spread" or standard deviation of test scores for a participant if that participant had been theoretically test is inversely assessed repeatedly using the same test related to Refers to the inherent error found surrounding reliability any test score observed test score = theoretical true score + error Related to test reliability (the more reliable the test the lower the standard error) Range is dependent on size of standard deviation (which is dependent on scores) and test reliability coefficient magnitude Slide 46 Typical range: 1 to 20 Hey what s this? It s reliability! The amount of error on a

Standard Error of measurement Product knowledge test Rick s observed score = 66.1% Theoretical test score = 65.2% Theoretical test score = 66.4% Theoretical test score = 63.7% Theoretical test score = 67.1% Theoretical test score = 65.8% Theoretical test score = 67.5% Theoretical test score = 65.9% Theoretical standard deviation = 1.26% 1.26% of error surrounding Rick s observed score Slide 47

Standard Error of the Mean Hey what s this? It s the number of Conceptually very similar to the standard participants who took the error of measurement but rather than test! The referring to error in an individual participant s more participants score, this refers to how much error there is the closer we in determining the true population mean get to a population The larger the sample size (i.e., number of participants) the smaller the standard error of the mean The more participants in a sample the greater likelihood that it approximates the population representation Slide 48

Standard Error of the Mean Sample (153 participants) Population (87,000 participants) 35 20000 # o f p a r t i c i p a n t s 30 25 20 15 10 5 # o f p a r t i c i p a n t s 18000 16000 14000 12000 10000 8000 6000 4000 2000 0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Participant scores 0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Participant scores Sample mean = 56.78% Sample standard deviation = 15.21% Standard error of mean = 1.23% within plus or minus 1 standard error of the mean value, 68 times out of 100, the true population mean will reside Slide 49 Typically the population information is not known, but if you could see the information Population true mean = 57. 81%

Using the Test Analysis Report Table of test statistics and Topic level statistical breakdown Provides the statistical details for the overall assessment as well as at The the more topic the levels better Why? Because it is about internal consistency The lower the better (except skew, closer to 0 better) The higher the better The higher the better Slide 50

Using the Test Analysis Report Frequency distribution and Histogram Displays the assessment results in tabular form and graphically Middle line median, most scores are between 25-75 percentiles Some higher scores some lower scores Slide 51 Data displayed graphically

The Results Management System (RMS) The Item Analysis Report and Test Analysis Report produce a snapshot of information in a static form What if you need/want to drop questions, change question scoring and see dynamically what the effects of making changes to your assessment results would be? Welcome to the RMS Slide 52

The Results Management System (RMS) New product, add-on to Questionmark Perception Review items in a test and drop, credit or alter scoring Review test results and define a pass or cut score Get a real-time preview of how proposed changes will impact overall item statistics and test reliability Publish results into a flat file database for access by reporting tools Maintain changes within an audit trail to aid assessment defensibility Slide 53

Results Management System Working Storage RMS Reporting Published Results 3 rd Party Reporting Tools Portfolios Data Warehouse Imports Results Results Management Published RMS Reports HR Database Questionmark Perception Reports Assessment Results Other Database Assessment Management System Slide 54

Drop or Credit Questions Review Item Difficulty Edit Angoff Estimate Borderline Item Discrimination Flagged Low Item Discrimination Flagged Real-time summary Slide 55 Distribute, Calculate, Save

Results Management System Demonstration (Why talk when we can show) Slide 56

Resources that can help Test Analysis Report guide: http://www.questionmark.com/perception/help/v4/manu als/er/report_types/test_analysis.htm RMS user guide: http://www.questionmark.com/us/whitepapers/index.asp x RMS and other white papers: http://www.questionmark.com/us/whitepapers/index.asp x Training sessions: Creating Assessments That Get Results (http://www.questionmark.com/us/training/) Slide 57

Summary Understanding some of the theory can help determine why questions are performing well or are not performing well Questionmark tools, such as the Item Analysis Report, Test Analysis Report, and Results Management System, provide the mechanisms to put theory into practice and get the most out of your assessments Applying (as much as possible) medium/high stakes standards to low stakes assessments will improve the information gleaned from Slide 58 assessments for all stakeholders

Closing Thank you very much for your time and interest Any questions? Slide 59