how good is the Instrument? Dr Dean McKenzie

Similar documents
Reliability and Validity checks S-005


DATA is derived either through. Self-Report Observation Measurement

ADMS Sampling Technique and Survey Studies

11-3. Learning Objectives

PTHP 7101 Research 1 Chapter Assignments

N Utilization of Nursing Research in Advanced Practice, Summer 2008

Validity. Ch. 5: Validity. Griggs v. Duke Power - 2. Griggs v. Duke Power (1971)

Closed Coding. Analyzing Qualitative Data VIS17. Melanie Tory

Overview of Experimentation

Validity refers to the accuracy of a measure. A measurement is valid when it measures what it is suppose to measure and performs the functions that

Reliability & Validity Dr. Sudip Chaudhuri

Validity. Ch. 5: Validity. Griggs v. Duke Power - 2. Griggs v. Duke Power (1971)

Making a psychometric. Dr Benjamin Cowan- Lecture 9

Reliability AND Validity. Fact checking your instrument

Lecture Week 3 Quality of Measurement Instruments; Introduction SPSS

Saville Consulting Wave Professional Styles Handbook

The Concept of Validity

CHAPTER VI RESEARCH METHODOLOGY

26:010:557 / 26:620:557 Social Science Research Methods

PÄIVI KARHU THE THEORY OF MEASUREMENT

HPS301 Exam Notes- Contents

Variable Measurement, Norms & Differences

Validity and reliability of measurements

Doctoral Dissertation Boot Camp Quantitative Methods Kamiar Kouzekanani, PhD January 27, The Scientific Method of Problem Solving

Research Questions and Survey Development

EVALUATING AND IMPROVING MULTIPLE CHOICE QUESTIONS

Factor structure and psychometric properties of the General Health Questionnaire. (GHQ-12) among Ghanaian adolescents

Examining the Psychometric Properties of The McQuaig Occupational Test

VARIABLES AND MEASUREMENT

Importance of Good Measurement

Reliability and Validity

Establishing Interrater Agreement for Scoring Criterion-based Assessments: Application for the MEES

THE USE OF CRONBACH ALPHA RELIABILITY ESTIMATE IN RESEARCH AMONG STUDENTS IN PUBLIC UNIVERSITIES IN GHANA.

Statistics for Psychosocial Research Session 1: September 1 Bill

Quality of Life Assessment of Growth Hormone Deficiency in Adults (QoL-AGHDA)

DATA GATHERING. Define : Is a process of collecting data from sample, so as for testing & analyzing before reporting research findings.

The study of relationship between optimism (positive thinking) and mental health (Case study: students of Islamic Azad University of Bandar Abbas)

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

and Screening Methodological Quality (Part 2: Data Collection, Interventions, Analysis, Results, and Conclusions A Reader s Guide

Gezinskenmerken: De constructie van de Vragenlijst Gezinskenmerken (VGK) Klijn, W.J.L.

Construct Reliability and Validity Update Report

Collecting & Making Sense of

Measurement is the process of observing and recording the observations. Two important issues:

Interrater and Intrarater Reliability of the Assisting Hand Assessment

LANGUAGE TEST RELIABILITY On defining reliability Sources of unreliability Methods of estimating reliability Standard error of measurement Factors

CHAPTER III RESEARCH METHODOLOGY

32.5. percent of U.S. manufacturers experiencing unfair currency manipulation in the trade practices of other countries.

ESTABLISHING VALIDITY AND RELIABILITY OF ACHIEVEMENT TEST IN BIOLOGY FOR STD. IX STUDENTS

Psychologist use statistics for 2 things

Editorial: An Author s Checklist for Measure Development and Validation Manuscripts

By Hui Bian Office for Faculty Excellence

Scaling the quality of clinical audit projects: a pilot study

Ch. 11 Measurement. Measurement

Validity and reliability of measurements

On the purpose of testing:

Do not write your name on this examination all 40 best

Measures. David Black, Ph.D. Pediatric and Developmental. Introduction to the Principles and Practice of Clinical Research

Constructing Indices and Scales. Hsueh-Sheng Wu CFDR Workshop Series June 8, 2015

The Research Roadmap Checklist

An International Study of the Reliability and Validity of Leadership/Impact (L/I)

Measurement and Descriptive Statistics. Katie Rommel-Esham Education 604

Ursuline College Accelerated Program

Georgina Salas. Topics EDCI Intro to Research Dr. A.J. Herrera

Agreement Coefficients and Statistical Inference

English 10 Writing Assessment Results and Analysis

Ch. 11 Measurement. Paul I-Hai Lin, Professor A Core Course for M.S. Technology Purdue University Fort Wayne Campus

02a: Test-Retest and Parallel Forms Reliability

Chapter -6 Reliability and Validity of the test Test - Retest Method Rational Equivalence Method Split-Half Method

Associate Prof. Dr Anne Yee. Dr Mahmoud Danaee

Psychometric Evaluation of Self-Report Questionnaires - the development of a checklist

CHAPTER III RESEARCH METHOD. method the major components include: Research Design, Research Site and

Figure 1: Design and outcomes of an independent blind study with gold/reference standard comparison. Adapted from DCEB (1981b)

The Short NART: Cross-validation, relationship to IQ and some practical considerations

Monday 4 June 2018 Afternoon Time allowed: 2 hours

Global Clinical Trials Innovation Summit Berlin October 2016

Validity, Reliability, and Defensibility of Assessments in Veterinary Education

Sample Exam Questions Psychology 3201 Exam 1

Measurement. 500 Research Methods Mike Kroelinger

What are Indexes and Scales

Lessons in biostatistics

TLQ Reliability, Validity and Norms

Evaluating the Merits of Using Brief Measures of PTSD or General Mental Health Measures in Two-Stage PTSD Screening

The Bengali Adaptation of Edinburgh Postnatal Depression Scale

Conducting Survey Research. John C. Ricketts

Empirical Knowledge: based on observations. Answer questions why, whom, how, and when.

Collecting & Making Sense of

Class 7 Everything is Related

Chronicles of Dental Research

Chapter 3 Psychometrics: Reliability and Validity

Section 6: Analysing Relationships Between Variables

MEASUREMENT, SCALING AND SAMPLING. Variables

AN ANALYSIS ON VALIDITY AND RELIABILITY OF TEST ITEMS IN PRE-NATIONAL EXAMINATION TEST SMPN 14 PONTIANAK

X I UNIVER5 ITY. Of 1LLI NOIS OF THL B R. A R. Y

SOME NOTES ON STATISTICAL INTERPRETATION

ACDI. An Inventory of Scientific Findings. (ACDI, ACDI-Corrections Version and ACDI-Corrections Version II) Provided by:

Chapter 7: Descriptive Statistics

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments

Measurement of Resilience Barbara Resnick, PHD,CRNP

Transcription:

how good is the Instrument? Dr Dean McKenzie BA(Hons) (Psychology) PhD (Psych Epidemiology) Senior Research Fellow (Abridged Version) Full version to be presented July 2014 1

Goals To briefly summarize the basic tools and concepts needed to answer the question Is this instrument any good? Mainly Psych instrument examples (my background) but also applies to instruments from other health areas 2

Psycho-Statistical Inventions Cohen s effect size / kappa / power analysis Cronbach s alpha (internal consistency) factor analysis (1904) item response theory, Rasch modelling, partial credit Likert scale, semantic differential McNemar s test Spearman correlation Test reliability Types of data: nominal, ordinal, interval, ratio (Stanley Smith Stephens, Science 1946, 103: 677-680) (list partly based on Cohen et al (2013) 3

numbers into Information F First female member of the Royal Statistical Society A coxcomb or polar area diagram Rehmeyer J (2008) Florence Nightingale: the passionate statistician. ScienceNews, 26 November. 2008 4

A typical Instrument, the Kessler Psychological Distress Scale (K10) Developed in US by Ronald Kessler et al, tested in Australia, employed in Victorian Population Health Survey, Australian National Health Surveys etc 10 questions such as in the past 4 weeks, about how often did you feel worthless and in the past 4 weeks, about how often did you feel nervous Similar in scope to David Goldberg s GHQ (General Health Questionnaire, 1972) but K10 is public domain (GHQ licensed by NFER-Nelson) Andrews G, Slade T (2001) Australian and New Zealand Journal of Public Health, 25, 6: 494-497 http://www.blackdoginstitute.org.au/docs/5.k10withinstructions.pdf 5

Determining instrument/test Quality Is this instrument any good? Reliability: Does it give consistent results, across different raters / situations? Validity: Does it measure what it is supposed to? Does instrument measure what it is supposed to, consistently and well, in the population(s) that it is applied to, here & now 6

Test Reliability = Repeatability or consistency Test-retest: Should achieve same scores, each time (if testing characteristics that are stable across time, such as intelligence, personality, rather than less stable characteristics, such as mood, pain levels but even then, stability may be an unrealistic assumption, Cohen et al 2013) Inter-rater: Different interviewers / raters should reach same conclusions / score levels 7

Measuring Test Reliability Measured by intraclass* correlation, should be 0.70 and above (1.0 is maximum) (e.g. if rater A s scores are 1,3,5 and rater B s scores are 2,4,6 (Karl) Pearson correlation would be 1.0 (perfect), but the raters never actually assign same score!) intraclass correlation often used for reliability & (Jacob) Cohen s kappa* chance-corrected for binary/categorical measures (*please see Streiner & Norman 2008 for more details) 8

Reliability? Internal consistency? Split half: correlation between first half of test/instrument and second half, or between odd numbered and even numbered items, should be 0.70 or above (Lee) Cronbach s alpha: equivalent to the average of all possible split halves of instrument, should be 0.70 or above Cronbach s alpha size is affected by number of items, and although everyone reports it, may not be sure what it actually means! 9

Internal consistency: Alpha Virtually every published measure will have adequate to good internal consistency (alpha) Miles J & Gilbert P (2005) A handbook of research methods for clinical & health psychology. Oxford. p. 100. Choosing a test merely on basis of Cronbach s alpha is a bit like choosing a restaurant on basis of whether the chefs wash their hands (can only hope that they all do so!) 10

Measuring Test Validity Appropriateness, meaningfulness, usefulness Face validity: extent to which test appears (at face value) to measure what it is intended to be measuring Content validity: extent to which all items reflect the characteristic:, e.g. depression Criterion validity: extent to which items correlate with a criterion (e.g. depression) the instrument supposedly measures (concurrent validity) or predict future scores / performance / diagnosis (predictive validity) Construct validity: concepts such as quality of life, intelligence, depression, cannot be measured directly, but measures of depression should correlate with each other (convergent validity) 11

Reliability/Validity: how high? In general, correlations for validity and reliability should be 0.70 or above, but validity and reliability values may not be provided in papers using an instrument, reliability and validity values are generally provided in papers defining the instrument (perhaps also original scoring manual) or testing the instrument (e.g., with younger or older adults, non-english speaking background etc) 12

Correlation Rule of thumb: correlation 0.8 to 1.0 very strong 0.6 to 0.8 strong 0.4 to 0.6 moderate Salkind (2013) Square of the correlation is the % of variation on one variable/measure/rater, accounted for by variation in other variable/measure/rater, so correlation of 0.7 = ~50% or half of the variation explained (variation remaining is known as unexplained, or the coefficient of alienation ) Reliability and validity should just report the size of the correlation (which should be 0.70 or above), not the statistical significance p <0.05 (with large enough samples, even trivial correlations can be significantly different from zero) 13

Quality checklist 1 Why use this particular instrument or test? 2 Are there are published guidelines for the use of this test? 3 Is this instrument reliable? (0.70 or above, recent, Australia, subpopulations?) 4 Is this instrument valid? (0.70 or above, recent, Australia, subpopulations?) 5 Is this instrument cost-effective (e.g. performance vs length)? 6 What inferences/conclusions may reasonably be made from this test, and how generalizable are the findings? (e.g. older adults, hospital patients, military / veterans?) Cohen et al (2013), pp. 126-127 14

Guide to (psych) Tests http:///www.unl.edu/buros (charge for use, unless accessed through library) Corcoran, K., Fischer, J. (2013). Measures for clinical practice and research. 5 th ed. Volume 2: Adults. Oxford. e.g. 940+ pages of measures of contentment, hardiness, loneliness, mindfulness, pain (e.g. 1 to 10 scale, sad face scale, PIQ-6) etc etc (reports Cronbach s alpha internal consistency, but remember that this is often used, but not very meaningful and is a function of how long the instrument is) 15

Supplement tests with open-ended questions? Since the early days of computers, openended questions have been regarded as problematic, although may provide extra information and insight compared with conventional instruments e.g. mixed method quantitative qualitative (e.g. Nvivo/QDA Miner/atlas.ti) or content analytic (e.g. Leximancer/Wordstat/LIWC) approaches Krippendorff K (2013) Content analysis. 3 rd ed. Sage. Richards L (2009) Handling qualitative data: a practical guide. 2 nd ed. Sage. Teddlie C, Tashakkori A (2009) Foundations of mixed methods. Sage. 16

Be sensitive, but specific Cut-points are generally chosen so as to maintain a balance between sensitivity and specificity Sensitivity = probability of being identified as positive by the screening test if person actually has the disease / diagnosis Specificity = probability of being identified as negative by the screening test if person doesn t have the disorder / diagnosis (function of disorder prevalence, but values of 0.70 or higher preferred) Positive predictive value = probability of person with positive test result actually having the disorder Negative predictive value = probability of person with negative test result not actually having the disorder Harris M, Taylor G (2008) Medical statistics made easy. 2 nd ed. Scion. 17

Future directions? Techniques for analysis of instruments constantly being developed but design of instruments themselves hasn t changed much in 50 years or more, i.e. pen and paper or simple computerized versions thereof Lots of scope for new instruments / new developments Bartram D, Hambleton RK (2006) Computer-based testing and the internet: issues and advances. Wiley. Wagner-Menghin, Masters GN (2013) Adaptive testing for psychological assessment. Journal of Applied Measurement, 14, 106-117. 18

Conclusion Assess the reliability (consistency) and validity (appropriateness, meaningfulness) Who has it been applied to, further reading as required And in terms of the empirical evidence as well as your clinical judgement, is the instrument any good? Is it the right horse for the course? 19

Further / Future Reading Cohen RJ, Swerdlik ME, Sturman ED (2013) Psychological testing and assessment: an introduction to tests and measurement. 8 th ed. McGraw-Hill. Garson GD (2013) Validity and reliability. (Free ebook from www.statisticalassociates.com or $US5 in Kindle format from www.amazon.com) Harvey J, Taylor V (2013) Measuring health and wellbeing. Sage. Salkind N (2013) Tests & measurement for people who (think they) hate tests & measurement. 2 nd ed. Sage. Streiner DL, Norman GR (2008) Health measurement scales: a practical guide to their development and use. 4 th ed. Oxford. 20