By Hui Bian Office for Faculty Excellence

Similar documents
Making a psychometric. Dr Benjamin Cowan- Lecture 9

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments

ESTABLISHING VALIDITY AND RELIABILITY OF ACHIEVEMENT TEST IN BIOLOGY FOR STD. IX STUDENTS

Analysis of the Reliability and Validity of an Edgenuity Algebra I Quiz

Reliability and Validity checks S-005

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

The Psychometric Development Process of Recovery Measures and Markers: Classical Test Theory and Item Response Theory

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

VARIABLES AND MEASUREMENT

Psychometrics for Beginners. Lawrence J. Fabrey, PhD Applied Measurement Professionals

Validity, Reliability, and Fairness in Music Testing

11-3. Learning Objectives

Importance of Good Measurement

Introduction to Reliability

Certification of Airport Security Officers Using Multiple-Choice Tests: A Pilot Study

Internal structure evidence of validity

Basic concepts and principles of classical test theory

Psychometric Instrument Development

THE TEST STATISTICS REPORT provides a synopsis of the test attributes and some important statistics. A sample is shown here to the right.

Use of the Quantitative-Methods Approach in Scientific Inquiry. Du Feng, Ph.D. Professor School of Nursing University of Nevada, Las Vegas

Reliability & Validity Dr. Sudip Chaudhuri

Associate Prof. Dr Anne Yee. Dr Mahmoud Danaee

Constructing Indices and Scales. Hsueh-Sheng Wu CFDR Workshop Series June 8, 2015

Statistics for Psychosocial Research Session 1: September 1 Bill

Designing a Questionnaire

SEMINAR ON SERVICE MARKETING

Measurement and Descriptive Statistics. Katie Rommel-Esham Education 604

Psychometric Instrument Development

Likert Scaling: A how to do it guide As quoted from

ATTITUDE SCALES. Dr. Sudip Chaudhuri. M. Sc., M. Tech., Ph.D. (Sc.) (SINP / Cal), M. Ed. Assistant Professor (Stage-3) / Reader

Chapter 3 Psychometrics: Reliability and Validity

Validity refers to the accuracy of a measure. A measurement is valid when it measures what it is suppose to measure and performs the functions that

LANGUAGE TEST RELIABILITY On defining reliability Sources of unreliability Methods of estimating reliability Standard error of measurement Factors

Evaluating the Reliability and Validity of the. Questionnaire for Situational Information: Item Analyses. Final Report

On the usefulness of the CEFR in the investigation of test versions content equivalence HULEŠOVÁ, MARTINA

Measurement of Constructs in Psychosocial Models of Health Behavior. March 26, 2012 Neil Steers, Ph.D.

Psychometric Instrument Development

Psychometric Instrument Development

ADMS Sampling Technique and Survey Studies

Research Questions and Survey Development

A Short Form of Sweeney, Hausknecht and Soutar s Cognitive Dissonance Scale

Psychological testing

Validating Measures of Self Control via Rasch Measurement. Jonathan Hasford Department of Marketing, University of Kentucky

Validity, Reliability, and Defensibility of Assessments in Veterinary Education

PLS 506 Mark T. Imperial, Ph.D. Lecture Notes: Reliability & Validity

RELIABILITY AND VALIDITY (EXTERNAL AND INTERNAL)

1. Evaluate the methodological quality of a study with the COSMIN checklist

Lecture Week 3 Quality of Measurement Instruments; Introduction SPSS

Development of a measure of self-regulated practice behavior in skilled performers

ORIGINS AND DISCUSSION OF EMERGENETICS RESEARCH

Reliability. Internal Reliability

Validity and reliability of measurements

Overview of Experimentation

Validity and reliability of measurements

The Concept of Validity


DATA is derived either through. Self-Report Observation Measurement

Manifestation Of Differences In Item-Level Characteristics In Scale-Level Measurement Invariance Tests Of Multi-Group Confirmatory Factor Analyses

Doctoral Dissertation Boot Camp Quantitative Methods Kamiar Kouzekanani, PhD January 27, The Scientific Method of Problem Solving

2016 Technical Report National Board Dental Hygiene Examination

So far. INFOWO Lecture M5 Homogeneity and Reliability. Homogeneity. Homogeneity

Introduction: Speaker. Introduction: Buros. Buros & Education. Introduction: Participants. Goal 10/5/2012

PÄIVI KARHU THE THEORY OF MEASUREMENT

EVALUATING AND IMPROVING MULTIPLE CHOICE QUESTIONS

Psychological testing

The DLOSCE: A National Standardized High-Stakes Dental Licensure Examination. Richard C. Black, D.D.S., M.S. David M. Waldschmidt, Ph.D.

Reliability AND Validity. Fact checking your instrument

Ch. 11 Measurement. Measurement

Approach to Measurement M. E. Swisher, Revised December, 2016

Oak Meadow Autonomy Survey

2 Types of psychological tests and their validity, precision and standards

Personality Traits Effects on Job Satisfaction: The Role of Goal Commitment

CHAPTER III RESEARCH METHODOLOGY

MEASUREMENT THEORY 8/15/17. Latent Variables. Measurement Theory. How do we measure things that aren t material?

Chapter 02 Developing and Evaluating Theories of Behavior

Reliability Analysis: Its Application in Clinical Practice

PTHP 7101 Research 1 Chapter Assignments

Issues That Should Not Be Overlooked in the Dominance Versus Ideal Point Controversy

Ch. 11 Measurement. Paul I-Hai Lin, Professor A Core Course for M.S. Technology Purdue University Fort Wayne Campus

Detecting Suspect Examinees: An Application of Differential Person Functioning Analysis. Russell W. Smith Susan L. Davis-Becker

Proceedings of the 2011 International Conference on Teaching, Learning and Change (c) International Association for Teaching and Learning (IATEL)

Self Report Measures

CHAPTER III RESEARCH METHOD. method the major components include: Research Design, Research Site and

Approach to Measurement M. E. Swisher, Revised December, 2017

The Concept of Validity

Selecting and Designing Instruments: Item Development, Reliability, and Validity John D. Hathcoat, PhD & Courtney B. Sanders, MS, Nikole Gregg, BA

Methodological Issues in Measuring the Development of Character

Assessing Measurement Invariance in the Attitude to Marriage Scale across East Asian Societies. Xiaowen Zhu. Xi an Jiaotong University.

26:010:557 / 26:620:557 Social Science Research Methods

Personal Style Inventory Item Revision: Confirmatory Factor Analysis

Measurement is the process of observing and recording the observations. Two important issues:

Chapter -6 Reliability and Validity of the test Test - Retest Method Rational Equivalence Method Split-Half Method

Developing and Testing Hypotheses Kuba Glazek, Ph.D. Methodology Expert National Center for Academic and Dissertation Excellence Los Angeles

Any phenomenon we decide to measure in psychology, whether it is

Survey research (Lecture 1) Summary & Conclusion. Lecture 10 Survey Research & Design in Psychology James Neill, 2015 Creative Commons Attribution 4.

Survey research (Lecture 1)

Lecture 4: Research Approaches

THE USE OF CRONBACH ALPHA RELIABILITY ESTIMATE IN RESEARCH AMONG STUDENTS IN PUBLIC UNIVERSITIES IN GHANA.

Summary & Conclusion. Lecture 10 Survey Research & Design in Psychology James Neill, 2016 Creative Commons Attribution 4.0

Transcription:

By Hui Bian Office for Faculty Excellence 1

Email: bianh@ecu.edu Phone: 328-5428 Location: 1001 Joyner Library, room 1006 Office hours: 8:00am-5:00pm, Monday-Friday 2

Educational tests and regular surveys Higher stake tests: such as credentialing tests (there is a passing score involved) systematic and thorough methods are required. We focus on lower stake tests/scales measuring social and psychological phenomena. 3

DeVellis, R. F. (2012). Scale development: Theory and applications (3 rd ed.). Thousand Oaks, CA: Sage Publications, Inc. Downing, S. M. & Haladyna, T. M. (2006). Handbook of test development. Mahwah, NJ: Lawrence Erlbaum Associates, Inc. 4

Fishman, J. A. & Galguera, T. (2003). Introduction to test construction in the social and behavioral sciences: A practical guide. Lanham, MD: Rowman & Littlefield Publishers, Inc. Pett, M. A., Lackey, N. R.,& Sullivan, J. J. (2003). Making sense of factor analysis: The use of factor analysis for instrument development in health care research. Thousand Oaks, CA: Sage Publications, Inc. 5

Why we need to develop a new survey/instrument/questionnaire? The measurement scale of interest doesn t exist. Or the existing survey is not enough for your research. 6

Goal of survey design Practically is to develop reliable, valid, and usable scales. 7

Relationship of theory and scales Measuring theoretical/hypothetical constructs Measuring measures without any theoretical foundation 8

Latent variables Something exist and influence people s behaviors but they are intangible: selfesteem, depression, anxiety, etc. Multiple items may be needed to capture the essence of those variables. 9

Latent variables We try to develop a collection of items to represent the level of an underlying theoretical variable. 10

Theory of Planned Behavior (TPB) 11

error1 Indica1 e7 error2 Indica2 Factor1 error3 error4 Indica3 Indica4 Construct error5 Indica5 Factor2 error6 Indica6 e8 12

Two measurement theories: Classical Test Theory (CTT) vs. Item Response Theory (IRT) CTT: observed score = true score + error, linkage between the scales and people being measured IRT: more focus on item performance, no linkage between scales and sample. 13

CTT: reliability will be enhanced by more items or better items. IRT: reliability will be improved by better items. It also concerns about how strongly an item relates to the latent variable. 14

Scale reliability: the consistency or stability of estimate of scores measured by the survey over time. 15

Reliability: in another word, scale reliability is the proportion of variance attributable to the true score of latent variable. Reliability = true score/observed score 16

CTT: observed score = true score + error Measurement error: the more error, the less reliable Systematic error: consistently reoccurs on repeated measures of the same instrument. Problems with the underlying construct (measure a different construct: affect validity) 17

Random error Inconsistent and not predictable Environment factors Administration variations 18

Internal consistency Homogeneity of items within a scale Items share a common cause (latent variable) Higher inter-item correlations suggest that items are all measuring the same thing. 19

Measures of internal consistency Cronbach s alpha (between 0 and 1) Kuder-Richardson formula 20 or KR-20 for dichotomous items Reliability analysis using SPSS (Cronbach s alpha): data can be dichotomous, ordinal, or interval, but the data should be coded numerically. 20

Cronbach s alpha: A high value of alpha is often used as evidence that the items measure an underlying (or latent) construct. However, a high alpha does not imply that the measure is unidimensional. 21

Alternate-forms reliability: having same group of people complete two separate versions of a scale. Correlate the scores from two implementations. 22

Split-half reliability Compare the first half items to the second half Compare the odd-numbered items with the even-numbered items 23

Test-retest reliability (temporal stability) Give one group of items to subjects on two separate occasions. The scores from the first occasion could be correlated with those from the second occasion. 24

Rule of thumb: strength of correlation.00-.29 weak.30-.49 low.50-.69 moderate.70-.89 strong.90-1.00 very strong Pett, Lackey, Sullivan (2003) 25

Rule of thumb : Cronbach s alpha (DeVellis, 2012) Below.60 unacceptable Between.60 and.65 undesirable Between.65 and.70 minimally acceptable Between.70 and.80 respectable Between.80 and.90 very good Above.90, should consider shortening the scale 26

How to control reliability 27

Validity is the most fundamental consideration in survey design. Validity: the scale truly measures what it is supposed to measure. The reliability of the outcomes depends on the soundness of the measures. 28

Handbook of test development: Effective test development requires a systematic, well-organized approach to ensure sufficient validity evidence to support the proposed inferences from the test scores.

Types of validity (DeVellis, 2012) Content validity Criterion-related validity Construct validity 30

Standards for Educational and Psychological Testing 1999 American Educational Research Association (AERA) American Psychological Association (APA) National Council on Measurement in Education (NCME) 31

Evidence based on test content Test content refers to the themes, wording, and format of the items and guidelines for procedures regarding administration. 32

Evidence based on response processes Target subjects. For example: whether the format more favors one subgroup than another group; In another word, something irrelevant to the construct may be differentially influencing performance of different subgroups. 33

Evidence based on internal structure The degree to which the relationships among instrument items and components conform to the construct on which the proposed relationships are based (Confirmatory factor analysis). Evidence based on relationships to other variables Relationships of test scores to variable external to the test. 34

Definition of the construct being examined Adequacy of the content Relevance of the content Concerns: how to know the selected items are representative and how to know the items capture the aspects of the construct. 35

It is critical to establish accurate and comprehensive content for an instrument. Selection of content is based on sound theories and empirical evidence or previous research. A content analysis is recommended. It is the process of analyzing the structure and content of the instrument. Two stages: development stage and appraisal stage 36

Instrument specification Content of the instrument Number of items The item formats The desired psychometric properties of the items Items and section arrangement (layout) Time of completing survey Directions to the subjects Procedure of administering survey 37

Content evaluation (Guion, 1977) The content domain must be with a generally accepted meaning. The content domain must be defined unambiguously The content domain must be relevant to the purpose of measurement. Qualified judges must agree that the domain has been adequately sampled. The response content must be reliably observed and evaluated. 38

Guion, R. (1977). Content validity: The source of my discontent. Applied Psychological Measurement, 1, 1-10. 39

Content evaluation Clarity of statements Relevance Coherence Representativeness 40

An scale is required to have empirical association with a criterion or gold standard. Collect data from new developed instrument and from criterion. 41

Evidence based on relationships to other variables. External variables can be measures of criteria, other tests measuring the same/related/different constructs. 42

Convergent and discriminant evidence Convergent: relationship of test scores and similar constructs. Discriminant: relationship of test scores and different constructs. 43

In order to demonstrate construct validity, we should provide evidence that the scale measures what it is supposed to measure. Construct validation requires the compilation of multiple sources of evidence. Content validity Item performance Criterion-related validity 44

Validity studies should address both the internal structure of the test and external relations of the test to other variables. Internal structure: subdomains or subconstruct External relations: relationships between test measures and other constructs or variables. 45

Construct-irrelevant variance: the degree to which test scores are affected by processes that are extraneous to the construct. 46

Construct-irrelevant variance Systematic error May increase or decrease test scores y = t + e 1 + e 2 y is the observed score. t is the true score. e 1 is random error (affect reliability). e 2 is systematic error (affect validity) 47

Construct underrepresentation: the degree to which a test fails to capture important aspects of the construct. It is about fidelity. It is about the dimensions of studied content. 48

Validation is the process of developing valid instrument and assembling validity evidence to support the statement that the instrument is valid. Validation is on-going process and validity evolves during this process. 49

12 steps of test development Overall plan Content definition Test specifications Item development Test design and assembly Test production

12 steps of test development Test administration Scoring test responses Passing scores Reporting test results Item banking Test technical report

Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8 Step 9 Determine what you want to measure Generate an item pool Determine the format for items Expert review of initial item pool Add social desirability items Pilot testing and item analysis Administer instrument to a larger sample Evaluate the items Revise instrument DeVellis (2012); Fishman & Galguera (2003); Pett, Lackey, & Sullivan (2003) 52

What will the instrument measure? Will the instrument measure the construct broadly or specifically, for example: self-efficacy or selfefficacy of avoiding drinking Do all the items tap the same construct or different one? Use sound theories as a guide. Related to content validity issues 53

It is also related to content validity Choose items that truly reflect underlying construct. Borrow or modify items from already existed instruments (they are valid and reliable). Redundancy: more items at this stage than in the final scale. A 10-item scale might evolve from a 40- item pool. 54

Writing new items (Focus group) Wording: clear and inoffensive Avoid lengthy items Consideration of reading difficulty level Avoid items that convey two or more ideas Be careful of positively and negatively worded items 55

Items include two parts: a stem and a series of response options. Number of response options A scale should discriminate differences in the underlying attributes Respondent s ability to discriminate meaningfully between options 56

Number of response options Equivocation: you have neutral as a response option Types of response format Likert scale Binary options Selected-response format (multiple choice format) 57

Components of instrument Format (font, font size) Layout (how many pages) Instructions to the subjects Wording of the items Response options Number of items 58

Purpose of expert review is to maximize the content validity. Panel of experts are people who are knowledgeable in the content area. Item evaluation How relevant is each item to what you intend to measure? Items clarity and conciseness Missing content Final decision to accept or reject expert recommendations It is developers responsibility 59

It is the tendency of subjects to respond to test items in such a way as to present themselves in socially acceptable terms in order to gain the approval of others. Individual items are influenced by social desirability. 10-item measures by Strahan and Gerbasi (1972) 60

Do those selected items cover the content completely? How many items should there be? How many subjects do we need to pilot test this instrument? 61

Also called tryout or field test Purposes Identify, remove or revise bad items Any problems related to item contents and formats Data collection procedure Prevents researchers from spending valuable resources on a study using not valid or reliable instrument Determine the amount of time to complete the instrument 62

Sample size: one tenth the size of the sample for the major study. People who participate in the pilot test can not be in the final study. 63

Item analysis: it is about item performance. Reliability and validity concerns at item level As means of detecting flawed items Help select items to be included in the test or identify items that need to be revised. Item selection needs to consider content, process, and item format in addition to item statistics. 64

Item response theory (IRT) Focuses on individual items and their characteristics. Reliability is enhanced not by redundancy but by indentifying better items. IRT items are designed to tap different degrees or levels of the attribute. The goal of IRT is to establish item characteristics independent of who completes them. 65

IRT concentrates on three aspects of an item s performance. Item difficulty: how hard the item is. Item discrimination: its capacity to discriminate. A less discriminating item has a larger region of ambiguity. Guessing or false positive 66

Knowing the difficulty of the items can avoid making a test so hard or so easy. The optimal distribution of difficulty is normal distribution. For dichotomous variable: (correct/wrong) The rate of wrong answers: 90 students out of 100 get correct answers, item difficulty = 10%. 67

Items Mean a59_9 2.04 a59_10 1.77 a59_11 1.93 a59_12 1.95 a59_13 1.60 a59_14 1.58 a59_15 1.61 a59_16 1.87 a59_17 2.75 a59_30 1.63 Four-point scale: 1 = Strongly agree, 2 = Agree, 3 = Disagree, 4 = Strongly disagree. Less difficult Strongly agree More difficult Strongly disagree 68

4 3 2 1 0 1.6 1.8 1.9 2 2.8 Difficulty distribution on a four-point scale 69

Inter-item consistency: scale reliability if item deleted Deletion of one item can increase overall reliability. Then that item is poor item. We can obtain that statistic from Reliability Analysis (SPSS) 70

71

Item validity Correlation of each item s response with the total test score minus the score for the item in question. Corrected item-total correlation. 72

Higher value of corrected itemtotal correlation is desired. 73

Item validity A bell-shaped distribution with its mean as high as possible. Higher correlation for an item means people with higher total scores are also getting higher item score. Items with low correlation need further examination. 74

Sample size: no golden rules 10-15 subjects/item 300 cases is adequate 50 very poor 100 poor 200 fair 300 good 500 very good 1000 or more excellent 75

Administration threats to validity Construct underrepresentation Construct irrelevant variance Efforts to avoid those threats Standardization Administrator training 76

Item analysis: item performance Factor analysis Purposes determine how many latent variables underlie a set of items. label indentified factors to help understand the meaning of underlying latent variables. 77

Factor analysis Exploratory factor analysis: to explore the structure of a construct. Confirmatory factor analysis: confirm the structure obtained from exploratory factor analysis. 78

Effects of dropping items Reliability Construct underrepresentation Construct irrelevant variance 79

80