how good is the Instrument? Dr Dean McKenzie BA(Hons) (Psychology) PhD (Psych Epidemiology) Senior Research Fellow (Abridged Version) Full version to be presented July 2014 1
Goals To briefly summarize the basic tools and concepts needed to answer the question Is this instrument any good? Mainly Psych instrument examples (my background) but also applies to instruments from other health areas 2
Psycho-Statistical Inventions Cohen s effect size / kappa / power analysis Cronbach s alpha (internal consistency) factor analysis (1904) item response theory, Rasch modelling, partial credit Likert scale, semantic differential McNemar s test Spearman correlation Test reliability Types of data: nominal, ordinal, interval, ratio (Stanley Smith Stephens, Science 1946, 103: 677-680) (list partly based on Cohen et al (2013) 3
numbers into Information F First female member of the Royal Statistical Society A coxcomb or polar area diagram Rehmeyer J (2008) Florence Nightingale: the passionate statistician. ScienceNews, 26 November. 2008 4
A typical Instrument, the Kessler Psychological Distress Scale (K10) Developed in US by Ronald Kessler et al, tested in Australia, employed in Victorian Population Health Survey, Australian National Health Surveys etc 10 questions such as in the past 4 weeks, about how often did you feel worthless and in the past 4 weeks, about how often did you feel nervous Similar in scope to David Goldberg s GHQ (General Health Questionnaire, 1972) but K10 is public domain (GHQ licensed by NFER-Nelson) Andrews G, Slade T (2001) Australian and New Zealand Journal of Public Health, 25, 6: 494-497 http://www.blackdoginstitute.org.au/docs/5.k10withinstructions.pdf 5
Determining instrument/test Quality Is this instrument any good? Reliability: Does it give consistent results, across different raters / situations? Validity: Does it measure what it is supposed to? Does instrument measure what it is supposed to, consistently and well, in the population(s) that it is applied to, here & now 6
Test Reliability = Repeatability or consistency Test-retest: Should achieve same scores, each time (if testing characteristics that are stable across time, such as intelligence, personality, rather than less stable characteristics, such as mood, pain levels but even then, stability may be an unrealistic assumption, Cohen et al 2013) Inter-rater: Different interviewers / raters should reach same conclusions / score levels 7
Measuring Test Reliability Measured by intraclass* correlation, should be 0.70 and above (1.0 is maximum) (e.g. if rater A s scores are 1,3,5 and rater B s scores are 2,4,6 (Karl) Pearson correlation would be 1.0 (perfect), but the raters never actually assign same score!) intraclass correlation often used for reliability & (Jacob) Cohen s kappa* chance-corrected for binary/categorical measures (*please see Streiner & Norman 2008 for more details) 8
Reliability? Internal consistency? Split half: correlation between first half of test/instrument and second half, or between odd numbered and even numbered items, should be 0.70 or above (Lee) Cronbach s alpha: equivalent to the average of all possible split halves of instrument, should be 0.70 or above Cronbach s alpha size is affected by number of items, and although everyone reports it, may not be sure what it actually means! 9
Internal consistency: Alpha Virtually every published measure will have adequate to good internal consistency (alpha) Miles J & Gilbert P (2005) A handbook of research methods for clinical & health psychology. Oxford. p. 100. Choosing a test merely on basis of Cronbach s alpha is a bit like choosing a restaurant on basis of whether the chefs wash their hands (can only hope that they all do so!) 10
Measuring Test Validity Appropriateness, meaningfulness, usefulness Face validity: extent to which test appears (at face value) to measure what it is intended to be measuring Content validity: extent to which all items reflect the characteristic:, e.g. depression Criterion validity: extent to which items correlate with a criterion (e.g. depression) the instrument supposedly measures (concurrent validity) or predict future scores / performance / diagnosis (predictive validity) Construct validity: concepts such as quality of life, intelligence, depression, cannot be measured directly, but measures of depression should correlate with each other (convergent validity) 11
Reliability/Validity: how high? In general, correlations for validity and reliability should be 0.70 or above, but validity and reliability values may not be provided in papers using an instrument, reliability and validity values are generally provided in papers defining the instrument (perhaps also original scoring manual) or testing the instrument (e.g., with younger or older adults, non-english speaking background etc) 12
Correlation Rule of thumb: correlation 0.8 to 1.0 very strong 0.6 to 0.8 strong 0.4 to 0.6 moderate Salkind (2013) Square of the correlation is the % of variation on one variable/measure/rater, accounted for by variation in other variable/measure/rater, so correlation of 0.7 = ~50% or half of the variation explained (variation remaining is known as unexplained, or the coefficient of alienation ) Reliability and validity should just report the size of the correlation (which should be 0.70 or above), not the statistical significance p <0.05 (with large enough samples, even trivial correlations can be significantly different from zero) 13
Quality checklist 1 Why use this particular instrument or test? 2 Are there are published guidelines for the use of this test? 3 Is this instrument reliable? (0.70 or above, recent, Australia, subpopulations?) 4 Is this instrument valid? (0.70 or above, recent, Australia, subpopulations?) 5 Is this instrument cost-effective (e.g. performance vs length)? 6 What inferences/conclusions may reasonably be made from this test, and how generalizable are the findings? (e.g. older adults, hospital patients, military / veterans?) Cohen et al (2013), pp. 126-127 14
Guide to (psych) Tests http:///www.unl.edu/buros (charge for use, unless accessed through library) Corcoran, K., Fischer, J. (2013). Measures for clinical practice and research. 5 th ed. Volume 2: Adults. Oxford. e.g. 940+ pages of measures of contentment, hardiness, loneliness, mindfulness, pain (e.g. 1 to 10 scale, sad face scale, PIQ-6) etc etc (reports Cronbach s alpha internal consistency, but remember that this is often used, but not very meaningful and is a function of how long the instrument is) 15
Supplement tests with open-ended questions? Since the early days of computers, openended questions have been regarded as problematic, although may provide extra information and insight compared with conventional instruments e.g. mixed method quantitative qualitative (e.g. Nvivo/QDA Miner/atlas.ti) or content analytic (e.g. Leximancer/Wordstat/LIWC) approaches Krippendorff K (2013) Content analysis. 3 rd ed. Sage. Richards L (2009) Handling qualitative data: a practical guide. 2 nd ed. Sage. Teddlie C, Tashakkori A (2009) Foundations of mixed methods. Sage. 16
Be sensitive, but specific Cut-points are generally chosen so as to maintain a balance between sensitivity and specificity Sensitivity = probability of being identified as positive by the screening test if person actually has the disease / diagnosis Specificity = probability of being identified as negative by the screening test if person doesn t have the disorder / diagnosis (function of disorder prevalence, but values of 0.70 or higher preferred) Positive predictive value = probability of person with positive test result actually having the disorder Negative predictive value = probability of person with negative test result not actually having the disorder Harris M, Taylor G (2008) Medical statistics made easy. 2 nd ed. Scion. 17
Future directions? Techniques for analysis of instruments constantly being developed but design of instruments themselves hasn t changed much in 50 years or more, i.e. pen and paper or simple computerized versions thereof Lots of scope for new instruments / new developments Bartram D, Hambleton RK (2006) Computer-based testing and the internet: issues and advances. Wiley. Wagner-Menghin, Masters GN (2013) Adaptive testing for psychological assessment. Journal of Applied Measurement, 14, 106-117. 18
Conclusion Assess the reliability (consistency) and validity (appropriateness, meaningfulness) Who has it been applied to, further reading as required And in terms of the empirical evidence as well as your clinical judgement, is the instrument any good? Is it the right horse for the course? 19
Further / Future Reading Cohen RJ, Swerdlik ME, Sturman ED (2013) Psychological testing and assessment: an introduction to tests and measurement. 8 th ed. McGraw-Hill. Garson GD (2013) Validity and reliability. (Free ebook from www.statisticalassociates.com or $US5 in Kindle format from www.amazon.com) Harvey J, Taylor V (2013) Measuring health and wellbeing. Sage. Salkind N (2013) Tests & measurement for people who (think they) hate tests & measurement. 2 nd ed. Sage. Streiner DL, Norman GR (2008) Health measurement scales: a practical guide to their development and use. 4 th ed. Oxford. 20