EVALUATING AND IMPROVING MULTIPLE CHOICE QUESTIONS

DePaul University INTRODUCTION TO ITEM ANALYSIS: EVALUATING AND IMPROVING MULTIPLE CHOICE QUESTIONS Ivan Hernandez, PhD

OVERVIEW What is Item Analysis? Overview Benefits of Item Analysis Applications Main Statistics of Item Analysis Item Difficulty Item Discrimination Test Reliability Implementing Item Analysis D2L Excel SPSS 2

WHAT IS ITEM ANALYSIS? 3

WHAT IS ITEM ANALYSIS Consider the following... Imagine you have a multiple-choice test Every student sees a collection of questions Each question has different choices; only one choice is correct We use the overall score on the exam to assess the student s aptitude/ability We want to know who understands the material and who doesn t We want to make sure that the student s score is stable Question: How do we know we assessing are the student's ability as well as we can? Answer: Item Analysis 4

WHAT IS ITEM ANALYSIS What is Item Analysis? Statistically analyzing your multiple choice test items So that you can ensure that your items are effectively evaluating student ability Item Analysis is a collection of techniques Many tools/methods to analyze questions Examples: Item Difficulty Item Discrimination Internal Consistency Differential Item Functioning 5

6 ITEM ANALYSIS PROCESS Item Analysis is an Iterative and Continuous Process Teach Content Area Consider learning goals Consider student abilities Write Test Items Have items reflect overall content area Eliminate ambiguous or misleading items Administer Exam Perform Item Analysis All students should be given sufficient time to complete exam Testing conditions should be consistent for all students Examine item performance Evaluate reasons for poor performance

BENEFITS 7

8 PURPOSE OF ITEM ANALYSIS Primary questions that can be answered by Item Analysis Were any of the questions too difficult or easy? How well do the questions identify those students who knew the material from those that did not? (The more of these questions, the better your exam can precisely measure ability)2 How consistent are the exam s questions?3 : Item Difficulty How stable are people s scores?3 2: Item Discrimination 3: Reliability

BENEFITS OF ITEM ANALYSIS Benefits of using Item Analysis Test Development - Improve test quality Assess ability more efficiently Produce scores that are stable Make the exam more coherent Precision - Quantify measurement of assessment Identify areas of improvement Understand characteristics of exam Have clear indication of item quality 9

APPLICATIONS

APPLICATIONS OF ITEM ANALYSIS Item Analysis can answer many questions ( Am I able to know who has ability and who does not, Am I able to get a fine-grained view on ability, How consistent are scores ) Many fields are concerned with knowing the answer to those questions: Academic exams (GRE, LSAT, ACT) Employee selection (Wunderlic, Situational Judgement Test) Personality assessment (Five Factor Model, Emotional Intelligence)

APPLICATIONS OF ITEM ANALYSIS: ACADEMIC EXAMS 2 Academic Exams: Purpose is to assess a student s ability relative to their peers Item analysis helps construct an exam where: The extremes (high and low scorers) are identifiable Scores are separated (Little overlap between students) Exam scores are consistent across examinations

APPLICATIONS OF ITEM ANALYSIS: EMPLOYEE SELECTION 3 Employee selection: Purpose is to select employees that have the highest ability in the workplace Item analysis helps construct an employee questionnaire where Employees abilities are distinguished from one another Slightly better employees are noticeable than slightly worse employees Scores on the employee questionnaire are consistent

APPLICATIONS OF ITEM ANALYSIS: PERSONALITY ASSESSMENT 4 Personality assessment: Purpose is to evaluate the behavioral tendencies of individuals Item analysis helps create a questionnaire where People high on a personality trait and low on a personality can be detected People s personality scores are fine grained People s personality score is consistent across evaluations

PRELIMINARY PREPARATION 5

BEFORE BEGINNING ITEM ANALYSIS Almost all item analysis procedures require the exam data to be a specific format Columns = Exam questions Rows = An individual examinee Cells = Whether the examinee got the question right () or wrong () 6

7 BEFORE BEGINNING ITEM ANALYSIS Almost all item analysis procedures require the exam data to be a specific format Columns = Exam questions Rows = An individual examinee Cells = Whether the examinee got the question right () or wrong () Examinee Question Question 2 Question 3 Question 4 Anne Bob Chelsea Dan Erica Fred

8 BEFORE BEGINNING ITEM ANALYSIS Almost all item analysis procedures require the exam data to be a specific format Columns = Exam questions Rows = An individual examinee Cells = Whether the examinee got the question right () or wrong () Examinee Question Question 2 Question 3 Question 4 Anne Bob Chelsea Dan Erica Fred

9 BEFORE BEGINNING ITEM ANALYSIS Almost all item analysis procedures require the exam data to be a specific format Columns = Exam questions Rows = An individual examinee Cells = Whether the examinee got the question right () or wrong () Examinee Question Question 2 Question 3 Question 4 Anne Bob Chelsea Dan Erica Fred

2 BEFORE BEGINNING ITEM ANALYSIS Almost all item analysis procedures require the exam data to be a specific format Columns = Exam questions Rows = An individual examinee Cells = Whether the examinee got the question right () or wrong () Examinee Question Question 2 Question 3 Question 4 Anne Bob Chelsea Dan Erica Fred

ACTIVITY: ITEM DIFFICULTY ) Complete the activity sheet passed out by the presenter 2) The activity sheet contains five students responses to five different multiple choice question 3) Create a data matrix that is formatted in a manner for item-analysis a) Each row is a different student b) Each column is a different question c) Each cell indicates whether the student answered the question correctly () or incorrectly() EXERCISE 2

ITEM DIFFICULTY 22

ITEM DIFFICULTY: WHAT IS IT? What is Item Difficulty? Item difficulty is how easy or hard a question is Examples If no one got the question right, the item is difficult If everyone got the question right, the item is considered easy If half the people got the question right, then the item is somewhere between easy and hard. 23

ITEM DIFFICULTY: WHY DOES IT MATTER? 24 Why Does Item Difficulty Matter? You want your test to provide information on the full range of people s ability If you don t pay attention to Item Difficulty, you don t get a precise measure of ability Helps ensure the full spectrum of ability is represented You want scores to be roughly symmetric If true ability has a bell-shaped distribution, then your estimated ability should have a bell-shaped distribution

25 ITEM DIFFICULTY: WHY DOES IT MATTER? You want your test to provide information on people s ranges of ability If question too easy -> everyone gets the question right -> you don t know who is on the lower end of ability If question too hard -> everyone gets question wrong -> you don t know who is on the higher end of ability Therefore, it is important to pay attention to item difficulty to have it be just right Student Question Alice % Bob % Cindy % Too Easy - Cannot tell who has more/less ability Student Question Alice % Bob % Cindy % Too hard - Cannot tell who has more/less ability

26 ITEM DIFFICULTY: WHY DOES IT MATTER? You want scores to be roughly symmetric Ability tends to follow a roughly normal distribution If item difficulty is too high or low, then scores will be truncated (prevents symmetry) Difficulty is too high Difficulty is just right Difficulty is too easy Therefore, it is important to pay attention to item difficulty to have it be just right

27 ITEM DIFFICULTY: EXAMPLE Examinee Q Q2 Q3 Alice Bob Imagine a test with THREE questions and FIVE examinees Everyone got Question correct Cindy Dan Erin Everyone got Question 2 wrong FOUR people got Question correct

ITEM DIFFICULTY: EXAMPLE Examinee Q Q2 28 Q3 Alice Bob Cindy Dan Erin Total 5 4 We can total up how many people got each question correct by taking the sum for each question Question = 5 Question 2 = Question 3 = 4

29 ITEM DIFFICULTY: EXAMPLE Examinee Q Q2 Q3 Alice Bob Cindy Dan Erin Total 5 We can total up how many people got each question correct by taking the sum for each question Question = 5 Question 2 = Question 3 = 4 Divide the total number of people who got the question correct by the total number of people who took the test, and you have the item s difficulty Question # Total correct/ Total test takers Item Difficulty Question 5 / 5 =. % Question 2 / 5 =. % Question 3 4/ 5 =.8 8% 4

ITEM DIFFICULTY: HOW TO CALCULATE IT How to Calculate Item Difficulty Count how many people answered the answer at all = N Count how many people answered the question correctly = P Item Difficulty = P / N 3

ITEM DIFFICULTY: HOW TO INTERPRET IT 3 How to Interpret Item Difficulty Think of it as Item Easiness Ranges from % to % Larger values = Easier Smaller values = Harder If value is at an ideal sweet spot, then your test can better separate high ability people from low ability people (discussed in next section) Ideal values depend on how answer choices there are: Two answer choices (true and false) =.75 Three answer choices=.67 Four answer choices=.63 Five answer choices=.6

ITEM DIFFICULTY: DIAGNOSTICS 32 Low item-difficulty can be problematic: Indicates that people, regardless of ability, could not answer the question correctly If Item Difficulty is too Low / Item is too Hard (<.25 or.3): The item may have been miskeyed The item may be too challenging relative to the overall level of ability of the class The item may be ambiguous or not written clearly There may be more than one correct answer.

ITEM DIFFICULTY: IMPROVEMENTS How to Improve Low Item Difficulty (Item is too hard) Make sure all items are keyed correctly Find a less challenging concept to assess Improve the class ability (typically via re-instruction or practice) Find where people are being confused and clarify in the question Make the question more specific 33

ITEM DIFFICULTY: PRECAUTIONS 34 Precautions on using Item Difficulty Not meaningful if Too few respondents (small sample size) Goal is to assessing mastery of specific concepts/problems The recommended values (.6-.75) assume you want to assess people s ability relative to others If you are concerned with content mastery, you want all items answered correctly Test had a short time limit ( speed test ) - later items seem difficult

35 ACTIVITY EXERCISE ) Break out into 4 separate groups 2) Each group will be assigned an item-difficulty value 3) Think of a question you can ask your fellow attendees that would probably have a difficulty value close to your group s assigned value Example: If you are assigned a difficulty value of.25, what is a question you could ask that only 25% of the attendees would know 4) Think of 4 multiple choice options to go with your question, one of which is right Remember: Item Difficulty: Percentage of people answering the question correctly 5) When you are ready, have one member of the group go up to the presenter and share the group s question and answers Larger values = Easier Smaller values = Harder 7) Access the spreadsheet link provided, and compute the difficulty for each question 6) WHEN TOLD THE SURVEY IS READY BY THE PRESENTER: Complete the combined survey online (link will be provided) - Skip your question 8) How close was the actual difficulty to the difficulty you were assigned?

ITEM DISCRIMINATION 36

ITEM DISCRIMINATION: WHAT IS IT? 37 What is Item Discrimination? Item discrimination is how much a question relates to a person s overall knowledge on the exam / ability High item discrimination = You either know it or you don t Examples of Question with Good Discrimination Smart people know the answer to the question, and low ability people don t know People who studied get the question right, people who didn t study get the question wrong

ITEM DISCRIMINATION Examinee Q Q2 38 Q3 Alice Bob Cindy Dan Erin Imagine a test with THREE questions and FIVE students

39 ITEM DISCRIMINATION Examinee Q Q2 Q3 Total Alice % Bob % Cindy 33% Dan 33% Erin 33% Imagine a test with THREE questions and FIVE students Alice and Bob did the best (perfect scores) Cindy, Erin, and Dan did the worst (33%)

4 ITEM DISCRIMINATION Examinee Q Q2 Q3 Total Alice % Bob % Cindy 33% Dan 33% Erin 33% Imagine a test with THREE questions and FIVE students Alice and Bob did the best (perfect scores) Cindy, Erin, and Dan did the worst (33%) Which question s performance best predicts who will score high on the exam (Alice and Bob) and who will score low (Cindy, Dan, and Erin)?

4 ITEM DISCRIMINATION Examinee Q Q2 Q3 Total Alice % Bob % Cindy 33% Dan 33% Erin 33% How people did on Question predicts who will score high and who will score low on the exam People who answered question correctly, got a % People who answered question incorrectly, got a % Question has good discrimination

42 ITEM DISCRIMINATION Examinee Q Q2 Q3 Total Alice % Bob % Cindy 33% Dan 33% How people did on Question 3 does not predicts who will score high and who will score low on the exam People who answered question 3 correctly, got a % Many who scored 33% also got question 3 correct Erin 33% Question 3 has worse discrimination than Question

ITEM DISCRIMINATION: HOW TO CALCULATE IT 43 How to Calculate Item Discrimination Make one column that has whether people got the question right () or wrong () = Question Scores Make another column that has people s total score on the exam ( to %) = Total Scores Item Discrimination = (Pearson s) correlation between Question scores and Total scores Commonly called Point-biserial correlation or Item-total correlation Often corrected by removing the item s score from the total score

ITEM DISCRIMINATION: HOW TO CALCULATE IT Examinee Q Q2 Q3 Total Alice 3 Bob 3 Cindy Dan Erin 44 Item-Discrimination = correlation between the item s performance and the total performance Correlation between Q column and Total column Discrimination =. Correlation between Q2 column and Total column Discrimination =.67 Correlation between Q3 column and Total column Discrimination =.4 Larger values indicate the item can better predict overall test performance

ITEM DISCRIMINATION: HOW TO INTERPRET IT 45 How to Interpret Item Discrimination Ranges from - to + (Almost always positive) Larger positive values = Question strongly relates to ability Smaller values = Question does not relate ability much Ideal values are positive and high (above +.2) positive = those who correctly answer a particular item also tend to do well on the test overall = no relationship between exam knowledge and getting the question right negative = the more you know, the less likely you are to get the question right If above.2, item is useful for describing people s overall ability

ITEM DISCRIMINATION: DIAGNOSTICS 46 Low item discrimination is problematic: Suggests that people who know the concepts really well overall, were not any more likely to understand the specific concept of the question If Item Discrimination is too Low (<.2): The item may be miskeyed Item may not represent domain of interest Item concept may not be taught well The item may be ambiguous The item maybe misleading The item may be too easy or difficult (everyone got question right or wrong)

ITEM DISCRIMINATION: IMPROVEMENTS To Improve Low Item Discrimination (<.2): Check to make sure the item is keyed correctly Check to make sure item is conceptually relevant Modify the instruction to explain the concept better Make the question more specific Ask how students interpreted the question Ensure that difficulty is at the ideal level, given the number of response options 47

ITEM DISCRIMINATION: PRECAUTIONS 48 Precautions on using Item Discrimination Not meaningful if Too few respondents (small sample size) Too few questions (doesn t capture topic area) Item difficulty too low or high (no variation for correlation) Partial credit for answers (some answers are less wrong than others)

ACTIVITY: ITEM DISCRIMINATION EXERCISE 49 ) Think to yourself about a multiple choice exam you might give in your respective field for a specific topic 2) What kinds of questions would you ask? 3) Which of those questions, if answered correctly would indicate that this person understands the topic well as a whole? a) This question has good discrimination b) Can tell you who likely has high knowledge and who has lower knowledge 4) Which of those questions, if answered correctly doesn t necessarily indicate that this person understands the topic well? a) This question has poor discrimination b) Cannot separate the high knowledge from the low knowledge students

RELIABILITY 5

TEST RELIABILITY: WHAT IS IT What is Test Reliability? Test Reliability = Consistency of Scores People s observed exam score is a mixture of true ability and error True ability = what you actually know for the entire topic Error = Fatigue, Misreading Question, Luck 5

TEST RELIABILITY: WHAT IS IT? 52 What Reliability Means? If test % reliable, then the score a person receives is their true score, and they get the same score each time they retake the exam There was no error on the exams Luck had nothing to do with the scores No questions were misread You were not tired The score is based completely on ability If our test is reliable, then a student s ability is reflected in the score received You are capturing pure ability instead of ability + error

TEST RELIABILITY: WHAT IS IT 53 What Reliability Means? If test not % reliability, then the score a person receives may be either higher or lower than their actual true score. Next score might be different Error played a big role in the scores You got lucky the first time You misread some questions You were tired Your true ability wasn t reflected in the scores If our test is unreliable, then a student s ability is not reflected in the score received

TEST RELIABILITY: WHAT IS IT 54 Ways of Measuring Reliability Test-retest reliability: Consistency from one examination point and another Parallel forms reliability: Consistency from one exam form and another Internal consistency reliability: Consistency of items with other items We re going to focus on internal consistency because it is the easiest to measure, and also provides highly useful information

INTERNAL CONSISTENCY: WHAT IS IT 55 What is Internal Consistency? Internal consistency is how consistent the items are with the other items High internal consistency = Questions are highly correlated (address a similar topic) and many questions Low internal consistency = Questions are unrelated and few questions Internal consistency measured with Cronbach s Alpha (also called KR-2 )

INTERNAL CONSISTENCY: WHAT DOES IT MATTER 56 Why Does Internal Consistency Matter? Cronbach s Alpha provides a lower-bound on the reliability of an exam If you know an exam internal consistency, then you know the worst-case of its reliability Reliability = correlation of scores on exam with scores on another equivalent exam Good internal consistency makes it more likely the students scores are stable

INTERNAL CONSISTENCY: HOW TO INTERPRET IT How to Interpret Internal Consistency Ranges from -Infinity to +Infinity (Almost always positive) Larger positive values = Test is highly reliable Smaller values = Test is not very reliable Ideal values are positive and high (above +.7) Alpha Interpretation >.9 Excellent.9 >.8 Good.8 >.7 Acceptable.7 > 6 Marginal <. 6 Poor 57

INTERNAL CONSISTENCY: HOW TO CALCULATE IT 58 How to Calculate Internal Consistency Count the number of questions = K For each question, make a column that has whether people got the question right () or wrong () Calculate the correlation between each right/wrong column and every other right/wrong column If K questions, then (K * (K-) / 2) comparisons If 2 questions, then (2 * (2-) / 2) comparisons If 4 questions, then (4 * (4-) / 2) comparisons Calculate the average inter-item correlation = r Apply the following formula:

ACTIVITY: TEST RELIABILITY 59 Internal consistency is affected by the test length and average inter-item correlation Test length: The more questions on the test, the more reliable the test will be. EXERCISE Average inter-item correlation: The more the questions address a single common domain, the more reliable the test will be - All questions pertain to the same topic area = Higher average correlation between question scores - All questions pertain to the disparate topic areas = Lower average correlation between question scores ) Imagine we ask students: a) Calculate the area of a hexagon b) Find the hypotenuse of a right triangle c) Find the missing angle in a triangle d) Find the radius of a circle 2) What additional question could we ask that would probably INCREASE the average inter-item correlation? 3) What additional question could we ask that would probably DECREASE the average inter-item correlation?

TEST RELIABILITY: DIAGNOSTICS 6 If Test Reliability is too Low (<.7): The item may be miskeyed Items represent too many distinct dimensions (too many concepts being asked) Too few items Items are not written clearly Items have poor difficulty and discrimination

TEST RELIABILITY: IMPROVEMENT To Improve Low Test Reliability (<.7): Check that items are keyed correctly Check that items are assessing common domain Increase the number of items (Increasing the number of items by 5% increases reliability by ~.) Clarify ambiguous items Recheck item difficulty and discrimination 6

TEST RELIABILITY: PRECAUTIONS Precautions on using Test Reliability Not meaningful if too few respondents (small sample size) Cronbach s Alpha only provides a lower-bound estimate of reliability - the actual reliability could be much higher 62

IMPLEMENTING ITEM ANALYSIS 63

IMPLEMENTING ITEM ANALYSIS You have several options for implementing item analysis D2L Excel SPSS 64

IMPLEMENTING ITEM ANALYSIS: D2L D2L provides many of the item analysis statistics Make a Quiz on D2L and collect responses Go to Quizzes Click on the dropdown menu next to the quiz name Click Statistics Click on Question Stats Average Grade = Item Difficulty Point-Biserial = Discrimination Discrimination Index = Similar to Point-Biserial, but not recommended 65

IMPLEMENTING ITEM ANALYSIS: D2L - STEPS VISUALIZED () Go to D2L Course Page and Click on Quizzes (3) Click on Question Stats to view Item Analysis 66 (2) Click on the menu for a specific quiz and select Statistics (4) Each item has its own analysis statistics

IMPLEMENTING ITEM ANALYSIS: D2L - THE OUTPUT Point Biserial = Item Discrimination Average Grade = Item Difficulty 67

IMPLEMENTING ITEM ANALYSIS: EXCEL You can calculate item analysis with Excel Need an Excel spreadsheet that has Each row is a different student Each column is a different question Each cell is either or indicating whether the student got the corresponding question right or wrong Example Spreadsheet: https://goo.gl/xccrge 68

IMPLEMENTING ITEM ANALYSIS: EXCEL - ENTER IN DATA 69 Enter in the data in the first Sheet, in the same format shown previously in the lesson

IMPLEMENTING ITEM ANALYSIS: EXCEL - EXAM ITEM STATISTICS 7 Each Item has its own Difficulty and Discrimination Score

IMPLEMENTING ITEM ANALYSIS: EXCEL - EXAM TEST STATISTICS 7 Cronbach s alpha = Internal Consistency

IMPLEMENTING ITEM ANALYSIS: SPSS 72 You can calculate item analysis with SPSS (or R, SPSS, Minitab, Stata) Access SPSS for free via DePaul Virtual Labs Enter the data as you would for an Excel spreadsheet Each row is a different student Each column is a different question Each cell is either or indicating whether the student got the corresponding question right or wrong At the top menu, go to Analyze -> Scale -> Reliability Analysis Move the Test Questions to the Items panel Click Statistics and ask for item, scale, and scale if item deleted statistics Click OK

IMPLEMENTING ITEM ANALYSIS: SPSS - STEP - CHOOSE ANALYSIS 73

IMPLEMENTING ITEM ANALYSIS: SPSS - STEP 2 - SELECT VARIABLES 74

IMPLEMENTING ITEM ANALYSIS: SPSS - STEP 3 - INTERPRETATION 75 Cronbach s alpha = Internal consistency Item Mean = Item difficulty Correct Item-Total Correlation = Item discrimination

76 SUMMARY

SUMMARY 77 Item Analysis can provide useful information when examining multiple choice tests How difficult were the questions? How well does a question contribute to understanding a person s performance? How reliable are the overall test scores? It is important to consider the reasons for why an item is performing poorly Items can perform poorly due to wording ambiguity, lack of ability in that domain, miscoding, lack of conceptual relevance, instructional issues Item Analysis is an iterative process - takes time

SUMMARY There are many tools for performing item analysis Ones we discussed: D2L, Excel, SPSS Many others available: Stata, SAS, R, PSPP, jmetrik 78

SUMMARY 79 Item Analysis is a collection of tools - there are more out there Alternative analysis - Are the incorrect answers equally likely to be chosen? Differential Item Functioning - Are the items fair between groups of people? Factor analysis - What underlying constructs are the items measuring? Item-Response Theory - What are people s ability, when you take into account the difficulty and discrimination of the item that people s answered correctly/incorrectly

Q&A