Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments Greg Pope, Analytics and Psychometrics Manager 2008 Users Conference San Antonio

Introduction and purpose of this session This session will present some of the theory behind assessment analysis to put the numbers into context The discussion will be as non-technical as possible with a more applied approach Questionmark tools that can help evaluate the performance of assessments and assessment items will be presented with applied examples If you have questions please do ask during the session Slide 2

Agenda A brief review of the theory Putting theory into practice, some Questionmark tools Score List Report, Coaching Report, Transcript Report Item analysis report Test analysis report Results Management System (RMS) Summary Question and answer period Slide 3

CTT and IRT, what's the diff? CTT (Classical Test Theory) is what we all know and love: P-values Discrimination statistics (point-biserial correlations, high minus low performance, etc.) Been around a long time (most of the 20 th century) Works very well for most applications and by far the most widely used form of item/test analysis Ability to work with smaller sample sizes (e.g., 150-200 or less) Relatively simple to compute (not fitting data to a model) Has a different set of assumptions from IRT Slide 4

CTT and IRT, what's the diff? IRT (Item Response Theory) is an alternative that some of us may have heard of or use: a-parameter: Item discrimination b-parameter: Item difficulty c-parameter: Item pseudo-guessing information Been around since the 1960s (Lord) Makes things like computer adaptive testing (CAT) and advanced test development techniques possible More complex to compute (fitting data to a model) Requires larger sample sizes depending on the number of parameters, more parameters more participant responses needed (e.g., 700+ for 3-PL) More information: http://edres.org/irt/ Slide 5

CTT and IRT CTT item analysis IRT item analysis: Item characteristic curve (ICC) Slide 6

CTT and IRT What is better CTT or IRT? Each are used for their own purposes and have pros and cons Questionmark currently uses CTT in our products Flexible in terms of sample sizes Fast to compute with few or no computational gotchas People are familiar with these statistics and so do not require learning a new measurement model CTT meets the needs of 99% or more of customers CTT statistics are related to IRT statistics to some degree P-values are highly correlated with b-values Point-biserial correlations are highly correlated with a-values We will be discussing CTT today Slide 7

Reliability Reliability is used in every day language: My car runs reliably means it starts very time We are going to be talking about test score reliability Essentially: How consistently the test scores measure a construct We can t go into all the detail here today, for a good primer into the theory see: Traub, R.E. (1994). Reliability for the Social Sciences: Theory & Applications. Thousand Oaks: Sage. Slide 8

Reliability (briefly, the theory) An assessment is a measurement instrument comprised of many individual measurements (questions/items) What is being measured is the ability, trait, construct, latent variable of interest (a massage therapy certification exam may measure massage knowledge/skills, an investment banking test may measure the construct knowledge of investment banking ) All measurement instruments have error in their estimates, so the traditional view of test score reliability says that a person s: observed score = theoretical True score + error Slide 9

Measurements and error 78.2 Measurements made by a thermometer of temperature are imperfect (atmospheric variables, sunlight, etc.) To mitigate this take lots of measurements using different, high quality thermometers 78.9 77.5 78.1 78.7 78.0 77.9 78.4 Slide 10

Measurements and error Q1 Measurements made by a test question of a construct are imperfect (participant fatigue, psychological variables, etc.) To mitigate this take lots of measurements using different, high quality questions Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Slide 11

Reliability Four approaches for measuring reliability: 1. Internal consistency: Correlations of items comprising the test (how well do they hang together ) 2. Split-half (split-forms): Correlation of two forms (splits) of the test (first 25 items versus last 25) 3. Test-retest: Correlation between multiple administrations of the same test 4. Inter-rater reliability: Correlation between two or more raters (markers) who rate the same thing (e.g., provide essay scores) Slide 12

Reliability: Internal consistency Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Slide 13

Reliability: Internal consistency Kuder-Richardson Formula 20 (KR-20) First published in 1937 Designed for dichotomous (1/0, right/wrong) items Values range from 0 to 1 (closer to 1 = higher reliability) Cronbach s Alpha Cronbach published in 1951 Designed for dichotomous and non-dichotomous (continuous, 1 to 5) items Generally values range from 0 to +1 (closer to +1 = higher reliability) Questionmark uses the Cronbach s Alpha on the Test Analysis Report and Results Management system Greater than 0.90 (high - acceptable for high stakes), 0.70 to 0.89 (moderate - acceptable for medium stakes), below 0.7 (low) Slide 14

Reliability practicalities What factors / test characteristics generally influence reliability coefficient values? Item difficulty: Items that are extremely hard or extremely easy affect discrimination and therefore reliability. If a large number of participants do not have time to finish the test this affects item difficulty Item discrimination: Items that have higher discrimination values will contribute more to the measurement efficacy of the assessment (more discriminating questions = higher reliability) Construct being measured: If all questions are measuring the same construct (e.g., from the same topic) reliability will be increased How many participants took the test: With very small numbers of participants the reliability value will be less stable How many questions are administered: Generally the more questions administered the higher the reliability Slide 15

Reliability and validity Validity (test scores) refers to: Whether the test is measuring what it should be measuring The processes followed to create the test and test questions That experts have had a chance to review and rubber stamp the processes That the test results predict the intended outcomes That the scores are use appropriately So validity is not a number it has to do with following best practices, conducting studies and research, using results fairly, etc. In order for an assessment to be valid it must be reliable Slide 16

Bringing the theory home Reliability and validity refer to the quality of tests and test items which translates into test scores Conducting analyses (analytics) on the test and test items will determine how well the questions are performing and how well the participants understood the material Understanding how to use the analytic tools at your disposal will help ensure high quality tests Lets get into some specifics Slide 17

Providing meaningful scores to participants One of the most important aspects of the assessment process has to do with providing meaningful scores to participants Depending on the stakes/purpose of your assessment program appropriate feedback to stakeholders (one of which is generally the participant) is crucial Many here are likely familiar with the Score List Report, Coaching Report, and Transcript Report so we won t spend time on this here today Slide 18

Score list report Slide 19

Coaching Report Slide 20

Transcript report Slide 21

Assessment and item quality The scores that are reported to participants and other stakeholders must be derived from high quality questions In order for the participant to obtain meaning and achieve learning from assessment results the assessments must measure what they are supposed to measure reliably Two core aspects of question quality has to do with the analysis of difficulty and discrimination Slide 22

Item difficulty and discrimination Item difficulty: P-value: Proportion of participants selecting the correct response or raw score converted to a percentage For a true/false (scored 1/0) question (true is the right answer) 0.650. 65.0% of participants selected True For a 0-5 question if the mean on the question is 3.75/5 then the p- value is 0.750 (3.75/5=0.750). The average score on the question is 75% Item discrimination: Point-biserial correlation (item total correlation): The correlation between the question scores and overall assessment scores for all participants Outcome discrimination: The Upper group minus the Lower group Slide 23

Item difficulty How a participant responds to a question says something about what they know and can do Question difficulty has to do both with the question and the participant This question is easy because lots of participants selected the correct answer This participant got this question right and so they have demonstrated knowledge/skills at this level Question difficulty is on a scale that is related to participant ability (knowledge/skills) Slide 24

Item discrimination Discrimination refers to how well an item discriminates/differentiates between participants of different knowledge/skill levels Experts in an area should get higher scores on the question and higher scores on the overall assessment Novices in the same area should get lower scores on the question and lower scores on the overall assessment Slide 25

Using the Item Analysis Report Composed of several sections: Information section Item difficulty (p-value) histogram and Item discrimination (outcome discrimination) histogram Question by question detailed analysis Summary information Information section provides details regarding when the report was created, etc. Slide 26

Using the Item Analysis Report Summary information (at the bottom of the report) Provides a summary of the average p-value, discrimination and item total correlation Slide 27

Using the Item Analysis Report Item difficulty (p-value) histogram and Item discrimination (outcome discrimination) histogram Provides a summary of the # of item by difficulty and discrimination Some harder Most items in the average difficulty range Some easier Some worse Most items have good discrim Some better Slide 28

Using the Item Analysis Report the better too hard? The Question Item by question detailed analysis more difficulty, Point- Provides a detailed analysis of each question biserial correlation Slide 29 Ran out of time? Not enough high, too many low? A lot of people thought these were the correct answers, is this being taught properly or are there item wording problems?

Using the Item Analysis Report reflects Hard question Correlation the high/low split Lower # Lots of high, no low (great!) Alternatives are pulling more of the low than the high, all are pulling some people Slide 30

Using the Test Analysis Report Composed of several sections: Information section Table of test statistics Topic level statistical breakdown Frequency distribution Histogram Information section provides details regarding when the report was created, etc. Slide 31 Remember, reducing sample size reduces measurement precision

Using the Test Analysis Report Table of test statistics and Topic level statistical breakdown Provides the statistical details for the overall assessment as well as at the topic levels Slide 32

Skew A measure of the symmetry of the distribution of scores (i.e., whether scores are pushed or skewed to one side or the other). Ranges from about -2 to +2 Negative skew Normal distribution (no skew) Positive skew # o f p a r t i c i p a n t s 20 18 16 14 12 10 8 6 4 2 # o f p a r t i c i p a n t s 20 18 16 14 12 10 8 6 4 2 # o f p a r t i c i p a n t s 20 18 16 14 12 10 8 6 4 2 0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Participant scores 0 0% 20% 40% 60% 80% 100% Participant scores 0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Participant scores Slide 33

Negative Skew # o f p a r t i c i p a n t s 20 18 16 14 12 10 8 6 4 2 0 Negative skew 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Participant score s Slide 34

No Skew # o f p a r t i c i p a n t s 20 18 16 14 12 10 8 6 4 2 0 Normal distribution (no skew) 0% 20% 40% 60% 80% 100% Participant score s Slide 35

Positive Skew # o f p a r t i c i p a n t s 20 18 16 14 12 10 8 6 4 2 0 Positive skew 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Participant score s Slide 36

Kurtosis ( n n( n 1)( n 1) 2)( n this to impress your friends A measure of the symmetry and family of a distribution of scores (i.e., how peaked/pointed versus flat are the distribution of scores what is happening at the tails). Normal range from about -3 to +3 3) It is important to memorize z 4 3( n 1) ( n 2)( n 2 3) Negative kurtosis (flat: platykurtic) Normal distribution (zero kurtosis: mesokurtic) Positive kurtosis (pointed: leptokurtic) # o f p a r t i c i p a n t s 18 16 14 12 10 8 6 4 2 # o f p a r t i c i p a n t s 20 18 16 14 12 10 8 6 4 2 # o f p a r t i c i p a n t s 20 18 16 14 12 10 8 6 4 2 0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Participant scores 0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Participant scores 0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Participant scores Slide 37

Positive (pointed) Kurtosis Positive kurtosis (pointed: leptokurtic) # o f p a r t i c i p a n t s 20 18 16 14 12 10 8 6 4 2 0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Participant score s Slide 38

Negative (nearly flat) Kurtosis Negative kurtosis (flat: platykurtic) # o f p a r t i c i p a n t s 18 16 14 12 10 8 6 4 2 0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Participant score s Slide 39

Zero (normal) Kurtosis Normal distribution (zero kurtosis: mesokurtic) # o f p a r t i c i p a n t s 20 18 16 14 12 10 8 6 4 2 0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Participant score s Slide 40

Mean (arithmetic) The most commonly used measure of central tendency (refers to the middle of a distribution of scores) Range of values depends on scores Slide 41

Median Another measure of central tendency, less sensitive than the mean to outliers Range of values depends on scores Where 50% of participants obtained higher scores and 50% of participants obtained lower scores Slide 42

Mode A third measure of central tendency, used a great deal in survey analysis The most common score in a distribution of scores Range of values depends on scores 34% 43% 56% 56% 56% 63% 67% 76% 88% Mode = 56% Slide 43

Standard deviation Participant s scores minus the mean gives a sense of the spread/variation The spread or variation of scores between participants of test scores Are the scores spread out (e.g., 0 to 100%) or clustered together (e.g., all scores between 55 and 62%) Range of values depends on scores Rick s test score = 75% Sally s test score = 83% Mark s test score = 53% Ella s test score = 91% Standard deviation = 16.36% Slide 44

Variance Another measure of variation The first step in calculating a standard deviation: The standard deviation is the square root of the variance Range of values depends on scores Used in some advanced calculations (e.g., Analysis of variance: ANOVA, Multiple analysis of variance: MANOVA) Slide 45

Standard Error of measurement The "spread" or standard deviation of test scores for a participant if that participant had been theoretically test is inversely assessed repeatedly using the same test related to Refers to the inherent error found surrounding reliability any test score observed test score = theoretical true score + error Related to test reliability (the more reliable the test the lower the standard error) Range is dependent on size of standard deviation (which is dependent on scores) and test reliability coefficient magnitude Slide 46 Typical range: 1 to 20 Hey what s this? It s reliability! The amount of error on a

Standard Error of measurement Product knowledge test Rick s observed score = 66.1% Theoretical test score = 65.2% Theoretical test score = 66.4% Theoretical test score = 63.7% Theoretical test score = 67.1% Theoretical test score = 65.8% Theoretical test score = 67.5% Theoretical test score = 65.9% Theoretical standard deviation = 1.26% 1.26% of error surrounding Rick s observed score Slide 47

Standard Error of the Mean Hey what s this? It s the number of Conceptually very similar to the standard participants who took the error of measurement but rather than test! The referring to error in an individual participant s more participants score, this refers to how much error there is the closer we in determining the true population mean get to a population The larger the sample size (i.e., number of participants) the smaller the standard error of the mean The more participants in a sample the greater likelihood that it approximates the population representation Slide 48

Standard Error of the Mean Sample (153 participants) Population (87,000 participants) 35 20000 # o f p a r t i c i p a n t s 30 25 20 15 10 5 # o f p a r t i c i p a n t s 18000 16000 14000 12000 10000 8000 6000 4000 2000 0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Participant scores 0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Participant scores Sample mean = 56.78% Sample standard deviation = 15.21% Standard error of mean = 1.23% within plus or minus 1 standard error of the mean value, 68 times out of 100, the true population mean will reside Slide 49 Typically the population information is not known, but if you could see the information Population true mean = 57. 81%

Using the Test Analysis Report Table of test statistics and Topic level statistical breakdown Provides the statistical details for the overall assessment as well as at The the more topic the levels better Why? Because it is about internal consistency The lower the better (except skew, closer to 0 better) The higher the better The higher the better Slide 50

Using the Test Analysis Report Frequency distribution and Histogram Displays the assessment results in tabular form and graphically Middle line median, most scores are between 25-75 percentiles Some higher scores some lower scores Slide 51 Data displayed graphically

The Results Management System (RMS) The Item Analysis Report and Test Analysis Report produce a snapshot of information in a static form What if you need/want to drop questions, change question scoring and see dynamically what the effects of making changes to your assessment results would be? Welcome to the RMS Slide 52

The Results Management System (RMS) New product, add-on to Questionmark Perception Review items in a test and drop, credit or alter scoring Review test results and define a pass or cut score Get a real-time preview of how proposed changes will impact overall item statistics and test reliability Publish results into a flat file database for access by reporting tools Maintain changes within an audit trail to aid assessment defensibility Slide 53

Results Management System Working Storage RMS Reporting Published Results 3 rd Party Reporting Tools Portfolios Data Warehouse Imports Results Results Management Published RMS Reports HR Database Questionmark Perception Reports Assessment Results Other Database Assessment Management System Slide 54

Drop or Credit Questions Review Item Difficulty Edit Angoff Estimate Borderline Item Discrimination Flagged Low Item Discrimination Flagged Real-time summary Slide 55 Distribute, Calculate, Save

Results Management System Demonstration (Why talk when we can show) Slide 56

Resources that can help Test Analysis Report guide: http://www.questionmark.com/perception/help/v4/manu als/er/report_types/test_analysis.htm RMS user guide: http://www.questionmark.com/us/whitepapers/index.asp x RMS and other white papers: http://www.questionmark.com/us/whitepapers/index.asp x Training sessions: Creating Assessments That Get Results (http://www.questionmark.com/us/training/) Slide 57

Summary Understanding some of the theory can help determine why questions are performing well or are not performing well Questionmark tools, such as the Item Analysis Report, Test Analysis Report, and Results Management System, provide the mechanisms to put theory into practice and get the most out of your assessments Applying (as much as possible) medium/high stakes standards to low stakes assessments will improve the information gleaned from Slide 58 assessments for all stakeholders

Closing Thank you very much for your time and interest Any questions? Slide 59