Types of Tests. Measurement Reliability. Most self-report tests used in Psychology and Education are objective tests :

Similar documents
Reliability and Validity checks S-005

Measurement and Descriptive Statistics. Katie Rommel-Esham Education 604

Psychometrics, Measurement Validity & Data Collection

ADMS Sampling Technique and Survey Studies

Do not write your name on this examination all 40 best

Preliminary Reliability and Validity Report

English 10 Writing Assessment Results and Analysis

Validity and reliability of measurements

Validity and reliability of measurements

Measuring the User Experience

EVALUATING AND IMPROVING MULTIPLE CHOICE QUESTIONS

Introduction to Reliability

Answer Key to Problem Set #1

Variable Measurement, Norms & Differences

Lab 4: Alpha and Kappa. Today s Activities. Reliability. Consider Alpha Consider Kappa Homework and Media Write-Up

TESTING AND MEASUREMENT. MERVE DENİZCİ NAZLIGÜL, M.S. Çankaya University

Georgina Salas. Topics EDCI Intro to Research Dr. A.J. Herrera

Collecting & Making Sense of


Collecting & Making Sense of

Statistics for Psychosocial Research Session 1: September 1 Bill

The Asian Conference on Education & International Development 2015 Official Conference Proceedings. iafor

Summary & Conclusion. Lecture 10 Survey Research & Design in Psychology James Neill, 2016 Creative Commons Attribution 4.0

Any phenomenon we decide to measure in psychology, whether it is

Assessing the Validity and Reliability of the Teacher Keys Effectiveness. System (TKES) and the Leader Keys Effectiveness System (LKES)

Survey research (Lecture 1) Summary & Conclusion. Lecture 10 Survey Research & Design in Psychology James Neill, 2015 Creative Commons Attribution 4.

Survey research (Lecture 1)

Reliability AND Validity. Fact checking your instrument

Figure 1: Design and outcomes of an independent blind study with gold/reference standard comparison. Adapted from DCEB (1981b)

Basic concepts and principles of classical test theory

CHAPTER 3 METHOD AND PROCEDURE

Lecture Week 3 Quality of Measurement Instruments; Introduction SPSS

Example of Interpreting and Applying a Multiple Regression Model

VARIABLES AND MEASUREMENT

Intro to SPSS. Using SPSS through WebFAS

Importance of Good Measurement

Validity of measurement instruments used in PT research

Understanding CELF-5 Reliability & Validity to Improve Diagnostic Decisions

PsychTests.com advancing psychology and technology

Two-Way Independent Samples ANOVA with SPSS

DATA is derived either through. Self-Report Observation Measurement

Lesson 2 Alcohol: What s the Truth?

Overview of Non-Parametric Statistics

Measurement Error 2: Scale Construction (Very Brief Overview) Page 1

Measuring Psychological Wealth: Your Well-Being Balance Sheet

CHAPTER III RESEARCH METHOD. method the major components include: Research Design, Research Site and

To provide you with necessary knowledge and skills to accurately perform 3 HIV rapid tests and to determine HIV status.

Making a psychometric. Dr Benjamin Cowan- Lecture 9

Overview of Experimentation

ANSWERS TO EXERCISES AND REVIEW QUESTIONS

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

Psychology Research Methods Lab Session Week 10. Survey Design. Due at the Start of Lab: Lab Assignment 3. Rationale for Today s Lab Session

LANGUAGE TEST RELIABILITY On defining reliability Sources of unreliability Methods of estimating reliability Standard error of measurement Factors

Day 11: Measures of Association and ANOVA

CHAPTER III RESEARCH METHOD

By Hui Bian Office for Faculty Excellence

4 Diagnostic Tests and Measures of Agreement

Intro to Factorial Designs

Empirical Knowledge: based on observations. Answer questions why, whom, how, and when.

Survey Question. What are appropriate methods to reaffirm the fairness, validity reliability and general performance of examinations?

SMS USA PHASE ONE SMS USA BULLETIN BOARD FOCUS GROUP: MODERATOR S GUIDE

Author's response to reviews

CHAPTER 3. Research Methodology

Quantitative Methods in Computing Education Research (A brief overview tips and techniques)

Using Lertap 5 in a Parallel-Forms Reliability Study

N Utilization of Nursing Research in Advanced Practice, Summer 2008

July Introduction

Unit outcomes. Summary & Conclusion. Lecture 10 Survey Research & Design in Psychology James Neill, 2018 Creative Commons Attribution 4.0.

Unit outcomes. Summary & Conclusion. Lecture 10 Survey Research & Design in Psychology James Neill, 2018 Creative Commons Attribution 4.0.

Complex Regression Models with Coded, Centered & Quadratic Terms

Chapter 5 Analyzing Quantitative Research Literature

HPS301 Exam Notes- Contents

Addressing error in laboratory biomarker studies

Designing a Questionnaire

Chapter 19. Confidence Intervals for Proportions. Copyright 2010 Pearson Education, Inc.

MULTIPLE OLS REGRESSION RESEARCH QUESTION ONE:

A new scale (SES) to measure engagement with community mental health services

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

COMPUTING READER AGREEMENT FOR THE GRE

Psychometric Instrument Development

Self Report Measures

Using a Likert-type Scale DR. MIKE MARRAPODI

Reliability. Internal Reliability

Steps to writing a lab report on: factors affecting enzyme activity

An Introduction to Research Statistics

Likert Scaling: A how to do it guide As quoted from

Model-based Diagnostic Assessment. University of Kansas Item Response Theory Stats Camp 07

Everything DiSC Manual

Daniel Boduszek University of Huddersfield

Guidelines for using the Voils two-part measure of medication nonadherence

Section 6: Analysing Relationships Between Variables

CHAPTER III RESEARCH METHODOLOGY

Biology 345: Biometry Fall 2005 SONOMA STATE UNIVERSITY Lab Exercise 5 Residuals and multiple regression Introduction

Demonstrating Client Improvement to Yourself and Others

Sample Exam Paper Answer Guide

Comparison of four scoring methods for the reading span test

Relationships. Between Measurements Variables. Chapter 10. Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Strategies to Develop Food Frequency Questionnaire

Chapter 3 CORRELATION AND REGRESSION

Transcription:

Measurement Reliability Objective & Subjective tests Standardization & Inter-rater reliability Properties of a good item Item Analysis Internal Reliability Spearman-Brown Prophesy Formla -- α & # items External Reliability Test-retest Reliability Alternate Forms Reliability Types of Tests Most self-report tests used in Psychology and Education are objective tests : the term objective refers to the question format and scoring process the most commonly used objective format is... Likert type items -- a statement is provided and the respondent is asked to mark one of the following Strongly Disagree Neither Agree Agree Strongly Agree nor Disagree Agree Different numbers of selections and verbal anchors are used (and don t forget about reverse-keying) this item format is considered objective because scoring it requires no interpretation of the respondents answer On the other hand, subjective tests require some interpretation or rating to score the response. Probably the most common type is called content coding Respondents are asked to write an answer to a question (usually a sentence to a short paragraph) The rater then counts the number of times each specified category of word or idea occurs and the counts of these becomes the score for the question. e.g., counting the attributes offered and sought in the archival exercise in 350 Lab Rater then assigns the all or part of the response to a particular category e.g., determining whether the author was seeking to contact a male or female

We would need to assess the inter-rater reliability of the scores from these subjective items. Have two or more raters score the same set of tests (usually 25-50% of the tests) Assess the consistency of the scores given by different raters using Cohen s Kappa -- common criterion is.70 or above Ways to improve inter-rater reliability improved standardization of the measurement instrument instruction in the elements of the standardization practice with the instrument -- with feedback experience with the instrument use of the instrument to the intended population/application Properties of a Good Item Each item must reflect the construct/attribute of interest content validity is assured not assessed Each item should be positively related to the construct/attribute of interest ( positively monotonic ) Item Response Lower Higher Lower values Higher Values Construct of Interest Scatter plot of each persons score in the item and construct perfect item great item common items bad items But, there s a problem We don t have scores on the construct/attribute So, what do we do??? Use our best approximation of each person s construct/attribute score -- which is Their composite score on the set of items written to reflect that construct/attribute Yep -- we use the set of untested items to make decisions about how good each of the items is But, how can this work??? We ll use an iterative process Not a detailed analysis -- just looking for really bad items

Process for Item Analysis 1st Pass compute a total score from the set of items written to reflect the specific construct/attribute recode the total score into five ordered categories divide the sample into five groups (low to high total scores) for each item plot the means of the five groups on the item look for items that are flat quadratic or backward drop bad items -- don t get carried away -- keep all you can 2nd Pass compute a new total from the items you kept re-recode the new total score into 5 categories replotall the items (including the ones dropped on 1st pass) Additional Passes repeat until stable Internal Reliability The question of internal reliability is whether or not the set of items hangs together or reflects a central construct. If each item reflects the same central construct then the aggregate (sum or average) of those items ought to provide a useful score on that construct Ways of Assessing Internal Reliability Split-half reliability the items were randomly divided into two half-tests and the scores of the two half-tests were correlated high correlations (.7 and higher) were taken as evidence that the items reflect a central construct split-half reliability is easily done by hand (before computers) but has been replaced by... Chronbach s α -- a measures of the consistency with which individual items inter-relate to each other i R - i i = # items α = ------- * --------- R = average correlation i - 1 R among the items From this formula you can see two ways to increase the internal consistency of a set of items increase the similarity of the items will increase their average correlation - R increase the number of items α-values range from 0-1.00 (larger is better) good α values are.6 -.7 and above

Assessing α using SPSS Item corrected alpha if item-total r deleted i1.1454.63 i2.2002.58 i3 -.2133.71 i4.1882.59 i5.1332.62 i6.2112.56 i7.1221.60 Coefficient Alpha =.58 Correlation between each item and a total comprised of all the other items (except that one) negative item-total correlations indicate either... very poor item reverse keying problems What the alpha would be if that item were dropped drop items with alpha if deleted larger than alpha don t drop too many at a time!! Tells the α for this set of items Usually do several passes rather that drop several items at once. Assessing α using SPSS Item corrected alpha if item-total r deleted i1.1454.63 i2.2002.58 i3 -.2133.71 i4.1882.59 i5.1332.62 i6.2112.56 i7.1221.60 Coefficient Alpha =.58 Pass #1 All items with - item-total correlations are bad check to see that they have been keyed correctly if they have been correctly keyed -- drop them notice this is very similar to doing an item analysis and looking for items within a positive monotonic trend Assessing α using SPSS Pass #2, etc Item corrected alpha if item-total r deleted i1.1612.74 i2.2202.68 i4.1822.70 i5.1677.74 i6.2343.64 i7.1121.76 Coefficient Alpha =.71 Check that there are now no items with - item-total corrs Look for items with alpha-ifdeleted values that are substantially higher than the scale s alpha value don t drop too many at a time probably i7 probably not drop i1 & i5 recheck on next pass it is better to drop 1-2 items on each of several passes

Whenever we ve considered research designs and statistical conclusions, we ve always been concerned with sample size We know that larger samples (more participants) leads to... more reliable estimates of mean and std, r, F & X 2 more reliable statistical conclusions quantified as fewer Type I and II errors The same principle applies to scale construction - more is better but now it applies to the number of items comprising the scale more (good) items leads to a better scale more adequately represent the content/construct domain provide a more consistent total score (respondent can change more items before total is changed much) In fact, there is a formulaic relationship between number of items and α (how we quantify scale reliability) the Spearman-Brown Prophesy Formula Here are the two most common forms of the formula Note: α X = reliability of test/scale α K = desired reliability k = by what factor you must lengthen test to obtain α K α K * (1 - α X ) k = ------------------ α X * (1 - α K ) k * α X α K = -------------------- 1 + ((k-1) * α X ) Starting with reliability of the scale (α X ), and desired reliability (α K ), estimate by what factor you must lengthen the test to obtain the desired reliability (k) Starting with reliability of scale (α X ), estimate the resulting reliability (α K ) if the test length were increased by a certain factor (k) Examples -- You have a 20-item scale with α X =.50 how many items would need to be added to increase the scale reliability to.70? α K * (1 - α X ).70 * (1 -.50) k = ------------------ = ------------------- = 2.33 α X * (1 - α K ).50 * (1 -.70) k is a multiplicative factor -- NOT the number of items to add to reach α K, we will need 20 * k = 20 * 2.33 = 46.6 = 47 items so we must add 27 new items to the existing 20 items Please Note: This use of the formula assumes that the items to be added are as good as the items already in the scale (I.e., have the same average inter-item correlation -- R) This is unlikely!! You wrote items, discarded the poorer ones during the item analysis, and now need to write still more that are as good as the best you ve got???

Examples -- You have a 20-item scale with α X =.50 to what would the reliability increase if we added 30 items? k = (# original + # new ) / # original = (20 + 30) / 20 = 2.5 Please Note: k * α X 2.5 *.50 α K = -------------------- = ------------------------- =.71 1 + ((k-1) * α X ) 1 + ((2.5-1) *.50) This use of the formula assumes that the items to be added are as good as the items already in the scale (i.e., have the same average inter-item correlation -- R) This is unlikely!! You wrote items, discarded the poorer ones during the item analysis, and now need to write still more that are as good as the best you ve got??? So, this is probably an overestimate of the resulting α if we were to add 30 items. Test-Retest Reliability External Reliability Consistency of scores across re-testing Test-Retest interval is usually 2 weeks to 6 months Assessed using a combination of correlation and wg t-test The key to assessing test-retest reliability is to recognize that we depend upon tests to give us the right score for each person. The score can t be right if it isn t consistent -- same score For years, assessment of test-retest reliability was limited to correlational analysis (r >.70 is good ) but consider the next slide Alternate Forms Reliability Sometimes it is useful to have two versions of a test -- called alternate forms If the test is used for any type of before vs. after evaluation Can minimize sensitization and reactivity Alternate Forms Reliability is assessed similarly to test-retest validity the two forms are administered - usually at the same time we want respondents to get the same score from both versions good alternate forms will have a high correlation and a small (nonsignificant) mean difference

External Reliability You can gain substantial information by giving a test-retest of the alternate forms Fa_t1 Fb-t1 Fa-t2 Fb-t2 Fa_t1 Fb-t1 Fa-t2 Fb-t2 Alternate forms evaluations Test-retest evaluations Mixed Evaluations Usually find that... AltF > T-Retest > Mixed Why? Evaluating External Reliability The key to assessing test-retest reliability is to recognize that we must assess what we want the measure to tell us sometimes we primarily want the measure to line up the respondents, so we can compare this order with how they line up on some other attribute this is what we are doing with most correlational research if so, then a reliable measure is one that lines up respondents the same each time assess this by simple correlating test-retest or alt-forms scores other times we are counting on the actual score to be the same across time or forms if so, even r = 1.00 is not sufficient (means could still differ) similar scores is demonstrated by a combination of good correlation (similar rank orders) no mean difference (similar center to the rankings) Here s a scatterplot of the test (x-axis) re-test (y-axis) data retest scores 50 r =.80 30 t = 3.2, p<.05 10 10 30 50 test scores What s good about this result? What s bad about this result? Good test-retest correlation Substantial mean difference -- folks tended to have retest scores lower than their test scores

Here s a another. Here s a another. retest scores 50 r =.30 30 t = 1.2, p>.05 retest scores 50 r =.80 30 t = 1.2, p>.05 10 10 30 50 test scores What s good about this result? Good mean agreement! What s bad about this result? Poor test-retest correlation! 10 What s good about this result? 10 30 50 test scores What s bad about this result? Not much! Good mean agreement and good correlation!