Basic Psychometrics for the Practicing Psychologist Presented by Yossef S. Ben-Porath, PhD, ABPP

Similar documents
Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

On the purpose of testing:


Validity. Ch. 5: Validity. Griggs v. Duke Power - 2. Griggs v. Duke Power (1971)

Measures. David Black, Ph.D. Pediatric and Developmental. Introduction to the Principles and Practice of Clinical Research

LANGUAGE TEST RELIABILITY On defining reliability Sources of unreliability Methods of estimating reliability Standard error of measurement Factors

Validity. Ch. 5: Validity. Griggs v. Duke Power - 2. Griggs v. Duke Power (1971)


11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

VARIABLES AND MEASUREMENT

Chapter 4: Defining and Measuring Variables

Validation of Scales

Empirical Knowledge: based on observations. Answer questions why, whom, how, and when.

Saville Consulting Wave Professional Styles Handbook

Testing and Intelligence. What We Will Cover in This Section. Psychological Testing. Intelligence. Reliability Validity Types of tests.

Introduction to Reliability

Test Validity. What is validity? Types of validity IOP 301-T. Content validity. Content-description Criterion-description Construct-identification

CHAPTER ONE CORRELATION

Throughout this book, we have emphasized the fact that psychological measurement

PÄIVI KARHU THE THEORY OF MEASUREMENT

Variables in Research. What We Will Cover in This Section. What Does Variable Mean?

RELIABILITY AND VALIDITY (EXTERNAL AND INTERNAL)

Research Questions and Survey Development

Understanding CELF-5 Reliability & Validity to Improve Diagnostic Decisions

Reliability & Validity Dr. Sudip Chaudhuri

Psychology Research Process

Measurement and Descriptive Statistics. Katie Rommel-Esham Education 604

EVALUATING AND IMPROVING MULTIPLE CHOICE QUESTIONS

Chapter 3. Psychometric Properties

Overview of Experimentation

Item Analysis Explanation

Statistical Methods and Reasoning for the Clinical Sciences

Clinician-reported Outcomes (ClinROs), Concepts and Development

Reliability. Internal Reliability

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments

HPS301 Exam Notes- Contents

Sample Exam Questions Psychology 3201 Exam 1

Chapter 3 Psychometrics: Reliability and Validity

Collecting & Making Sense of

11-3. Learning Objectives

ADMS Sampling Technique and Survey Studies

Collecting & Making Sense of

PTHP 7101 Research 1 Chapter Assignments

Any phenomenon we decide to measure in psychology, whether it is

Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys

The Psychometric Principles Maximizing the quality of assessment

Appendix B Statistical Methods

Associate Prof. Dr Anne Yee. Dr Mahmoud Danaee

12/31/2016. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

Unit 1 Exploring and Understanding Data

26:010:557 / 26:620:557 Social Science Research Methods

Validity refers to the accuracy of a measure. A measurement is valid when it measures what it is suppose to measure and performs the functions that

CLINICAL VS. BEHAVIOR ASSESSMENT

PLS 506 Mark T. Imperial, Ph.D. Lecture Notes: Reliability & Validity

Regression CHAPTER SIXTEEN NOTE TO INSTRUCTORS OUTLINE OF RESOURCES

How Do We Gather Evidence of Validity Based on a Test s Relationships With External Criteria?

Doctoral Dissertation Boot Camp Quantitative Methods Kamiar Kouzekanani, PhD January 27, The Scientific Method of Problem Solving

Evaluation and Assessment: 2PSY Summer, 2017

TLQ Reliability, Validity and Norms

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS

Psychology Research Process

No part of this page may be reproduced without written permission from the publisher. (

The Short NART: Cross-validation, relationship to IQ and some practical considerations

Simple Linear Regression the model, estimation and testing

MMPI-2 short form proposal: CAUTION

Ch. 11 Measurement. Measurement

USE OF THE MMPI-2-RF IN POLICE & PUBLIC SAFETY ASSESSMENTS

Variables in Research. What We Will Cover in This Section. What Does Variable Mean? Any object or event that can take on more than one form or value.

In search of the correct answer in an ability-based Emotional Intelligence (EI) test * Tamara Mohoric. Vladimir Taksic.

Interpreting change on the WAIS-III/WMS-III in clinical samples

Reliability AND Validity. Fact checking your instrument

Importance of Good Measurement

Class 7 Everything is Related

CHAPTER VI RESEARCH METHODOLOGY

Georgina Salas. Topics EDCI Intro to Research Dr. A.J. Herrera

Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data

Assessing the Validity and Reliability of the Teacher Keys Effectiveness. System (TKES) and the Leader Keys Effectiveness System (LKES)

Statistics for Psychology

Ch. 11 Measurement. Paul I-Hai Lin, Professor A Core Course for M.S. Technology Purdue University Fort Wayne Campus

CHAPTER III RESEARCH METHODOLOGY

Questionnaire design. Questionnaire Design: Content. Questionnaire Design. Questionnaire Design: Wording. Questionnaire Design: Wording OUTLINE

Analysis of the Reliability and Validity of an Edgenuity Algebra I Quiz

Basic concepts and principles of classical test theory

32.5. percent of U.S. manufacturers experiencing unfair currency manipulation in the trade practices of other countries.

Regression Discontinuity Analysis

Error in the estimation of intellectual ability in the low range using the WISC-IV and WAIS- III By. Simon Whitaker

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Prepared by: Assoc. Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies

CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA

CHAPTER 2. RESEARCH METHODS AND PERSONALITY ASSESSMENT (64 items)

ISC- GRADE XI HUMANITIES ( ) PSYCHOLOGY. Chapter 2- Methods of Psychology

Chapter 6 Topic 6B Test Bias and Other Controversies. The Question of Test Bias

Variables in Research. What We Will Cover in This Section. What Does Variable Mean? Any object or event that can take on more than one form or value.

Results & Statistics: Description and Correlation. I. Scales of Measurement A Review

11/24/2017. Do not imply a cause-and-effect relationship

Treatment Effects: Experimental Artifacts and Their Impact

3 CONCEPTUAL FOUNDATIONS OF STATISTICS

Conditional Distributions and the Bivariate Normal Distribution. James H. Steiger

Theory Testing and Measurement Error

Transcription:

Basic Psychometrics for the Practicing Psychologist Presented by Yossef S. Ben-Porath, PhD, ABPP 2 0 17 ABPP Annual Conference & Workshops S a n Diego, CA M a y 1 8, 2 017

Basic Psychometrics for The Practicing Psychologist Yossef S. Ben-Porath Kent State University ybenpora@kent.edu

Learning Objectives Attendees will be able to: Conceptualize and explain correlational statistics Evaluate the strengths and weaknesses of various methods for estimating reliability Use reliability estimates to calculate and use confidence intervals and standard errors of measurement Evaluate the validity of psychological test scores used in psychological assessments Correlational Statistics Correlation A researcher seeks to develop a psychological test to assist in identifying individuals with good leadership ability He/she has studied the scientific literature on what makes good leaders and has concluded that dominance - the ability to influence others through interpersonal interactions, plays an important role in leadership To explore this further, correlational analyses will be used to establish the association between dominance test scores and leadership ratings Correlation A sample of individuals is administered a psychological test measuring dominance (variable X) and receive leadership ratings (variable Y) by peers who know them well These two procedures yield Raw Scores for each participant on variables X and Y Individual raw scores on psychological measures typically have no intrinsic meaning They must be converted into a meaningful, standard metric, that is, they must be standardized 1

Correlation Deviation Scores reflect an individual s standing on a variable in reference to a group mean (M) M = X x X M X M X =Mean of variable X scores Σ= Sum of X= Raw Score N=Number of participants (i.e., sample size) x= deviation score of X Like raw scores, deviation scores have no intrinsic meaning (though they are a bit more informative) because they are still expressed on the same arbitrary metric as the raw score Standard Scores reflect an individual s standing on a variable in reference to their group mean and its associated standard deviation x 2 1 =Standard Deviation for Mean of X Standard Scores reflect an individual s standing on a variable in reference to their group mean and its associated standard deviation x 2 1 Z= x =Standard Deviation for Mean of X Z x = Standard Z Score for X Standard Scores are expressed on a uniform metric and can be compared across variables. Z scores are standard scores that have a mean of zero and a standard deviation of 1 Some Standard Score Equivalents 2

Correlation The Pearson Product Moment Correlation (r) represents the extent to which standardized deviations from the mean on one variable (X) are associated with standardized deviations from the mean on a second variable (Y). ( ) 1 r r xy = Correlation between variables X and Y Scatterplots & Correlations for Positive Values of r ( Z) 1 Sum of product of Z scores divided by N-1 Scatterplots & Correlations for Negative Values of r Correlation Measures with limited variability (i.e., not much deviation from the mean) cannot be correlated strongly with other measures Extreme example where both variables have no variance: Subject X Y 1 40 55 2 40 55 3 40 55 4 40 55 5 40 55 3

Correlation Measures with limited variability (i.e., not much deviation from the mean) cannot be correlated strongly with other measures Extreme example where both variables have no variance: Subject X Y 1 40 55 ( Z) 2 40 55 3 40 55 1 4 40 55 5 40 55? Correlation Example where only 1 variable has limited variance: Subject X Y 1 40 50 2 40 55 3 40 52 4 40 65 5 40 44 ( Z) 1? Scatterplots Illustrating Unrestricted and Restricted Ranges Scatterplots Illustrating Unrestricted and Restricted Ranges 4

Scatterplots Illustrating Unrestricted and Restricted Ranges Scatterplots Illustrating Unrestricted and Restricted Ranges Range Restriction Scattergram of adult males and NBA players, illustrating the effect of restriction of range on the correlation of height and weight Limited variance (a.k.a. Range restriction) may be a function of the variable, in which case it s not going to be very useful as a predictor of any other variable Or it may be a function of restricted sampling, in which case the data will underestimate the true association between variables As just noted, range restriction in just one variable will lower the observed correlation between both 5

Range Restriction When might range restriction affect psychology research findings: Restricted sampling: Use of convenience samples (e.g., college students, prison inmates) to study psychopathology (e.g., association between psychopathy and substance abuse) Range restriction will produce underestimate of true association Low base rate phenomena (e.g., suicide) Are difficult to predict because of range restriction Regression In assessment, we typically use correlations between variables to predict an individual s standing on one variable (leadership), as a function of their standing on another (dominance). This can be accomplished by using regression models that take on the basic form: Ŷ=bX+a where: Ŷ=predicted (leadership) score b=slope of the regression line X=raw test (dominance) score br SD SD Y X a= Y intercept a=m Y -b*m X Numerical Example Dominance Leadership Dominance Leadership Z X Z Y X Y Z X Z Y 10 8 1.57 1.26 1.98 4 3-0.57-1.60 0.91 5 7-0.22 0.69-0.15 6 7 0.14 0.69 0.10 2 4-1.29-1.03 1.33 5 5-0.22-0.46 0.10 9 6 1.22 0.11 0.13 8 6 0.86 0.11 0.09 3 4-0.93-1.03 0.96 4 8-0.57 1.26-0.72 M X =5.6 M Y =5.8 SD X =2.79 SD Y =1.75 X Y Z Z r N 1 053. 6

Dominance Dominance Regression and Range Restriction Regression and Range Restriction Ŷ=bX+a where: Ŷ=predicted (leadership) score b=slope of the regression line X=raw test (dominance) score br SD SD Y X Ŷ=bX+a where: Ŷ=predicted (leadership) score b=slope of the regression line X=raw test (dominance) score br SD SD Y X a= Y intercept a=m Y -b*m X a= Y intercept a=m Y -b*m X Implications: o Range-restricted correlations (r) result in greater prediction error 7

Regression and Range Restriction Ŷ=bX+a where: Ŷ=predicted (leadership) score b=slope of the regression line X=raw test (dominance) score br SD SD Y X Reliability and Measurement Error a= Y intercept a=m Y -b*m X Implications: o Range-restricted correlations (r) result in greater prediction error o If SS Y and SD X differ across samples or settings prediction errors will vary as well Reliability Definitions: Conceptual Psychometric Ways of estimating Applications: Estimating true scores Establishing confidence intervals Reliability Defined Conceptually The reliability of a test score reflects the precision with which it represents an individual s standing on the test If we administer a WAIS-IV, how accurate is the resulting IQ score as an indication of the individual s actual IQ If Mr. Jones obtains an IQ score of 120, how much confidence do we have that his actual IQ (not his intelligence!) is 120 8

Reliability Defined Psychometrically In Classical Test Theory: X=T+e Where: X= Observed Score (The test score) T= True Score (the individual s actual score if measured without error) e= Random Measurement Error Two types of Measurement Error Random Measurement Error results from unsystematic factors that contribute to the observed score: Test-taker: fatigue distractibility mood Testing conditions lighting noise variability in the examiner By definition, random measurement error cannot be correlated with any other variable If it correlates with another variable, it is not unsystematic Two types of Measurement Error Systematic Error affects the True Score, and, by definition, is not random. It affects test score validity rather than reliability. Examples of Systematic Error: Items don t measure intended construct Reactivity Bias Systematic error can correlate with other variables Reliability Defined Psychometrically Every time we conduct a measurement, we encounter a different amount of random error: X 1 =T+e 1 X 2 =T+e 2 X 3 =T+e 3 X k =T+e k T remains constant, whereas X i varies as a function of e i. Changes in X are a product of random error 9

Reliability Defined Psychometrically Random measurement error will sometimes push the observed score above the true score and other times it will pull the observed score below the true score. The mean of a hypothetical, infinite number of measurements is equal to the True Score. Reliability Defined Psychometrically Random measurement error will sometimes push the observed score above the true score and other times it will pull the observed score below the true score. The mean of a hypothetical, infinite number of measurements is equal to the True Score. Reliability Defined Psychometrically Random measurement error will sometimes push the observed score above the true score and other times it will pull the observed score below the true score. The mean of a hypothetical, infinite number of measurements is equal to the True Score. T 1 This method for determining the true score is impossible to implement. Xi Reliability Defined Psychometrically We can, alternatively, estimate the correlation between observed and true scores, r XT If r XT =1.0, then X=T and there is no error Random Error is reflected in the extent to which r XT <1.0 Expressed in terms of standard deviations from the mean: r If σ X > σ T, then r XT < 1.0 10

Reliability Defined Psychometrically The square of a correlation indicates the proportion of variance shared by the two variables. The squared correlation between the Observed and the True score reflects the proportion of true score variance in Observed Score variance. This is the psychometric definition of reliability: r 2 2 2 r Ways of Estimating Reliability Because we don t know the actual true score, we can only estimate reliability by indirect means: Test-Retest Alternate (Equivalent) Forms Internal Consistency Inter-Scorer/Inter-Rater Reliability is the ratio of True Score variance to Observed Score variance. Test-Retest Reliability A test is administered to a group of people twice. Scores are correlated across Time 1 and Time 2. X 1 =T 1 +e 1 X 2 =T 2 +e 2 Because they are random variables, e 1 and e 2 cannot be correlated with each other or with T1, T 2, X 1, or X 2. It is assumed that: T 1 =T 2 2 It can be derived that rxx r 1 2 2 x xx Test-Retest Reliability Shortcomings Reasons why T 1 may not equal T 2 Temporal instability Repeated measurement If T 1 T 2 Then r X will underestimate reliability by 1X2 attributing changes from T 1 to T 2 to measurement error. The more unstable the measured construct, and the longer the interval between measurements, the more likely this is to occur. Example: Anger 11

Alternate Forms Reliability Rather than administer the same test (X) twice, an alternate form (Y), assumed to be equivalent to X, is administered and scores are correlated across forms administered at the same time X 1 =T 1 +e 1 Y 1 =T 2 +e 2 Because they are random variables, e 1 and e 2 cannot be correlated with each other or with T1, T 2, X 1, or Y 1. It is assumed that: T 1 =T 2 2 It can be derived that rxy r 1 1 2 x xx Alternate Forms Reliability Shortcomings Reasons why T 1 may not equal T 2 Forms may not be truly equivalent measures Repeated measurement If T 1 T 2 Then r X will underestimate reliability by 1y1 attributing changes from T 1 to T 2 to measurement error The more difficult it is to construct truly equivalent forms, the more likely this is to occur More likely to be problematic when measuring personality or psychopathology than cognitive functioning Internal Consistency Reliability Rather than construct alternate forms, we treat each item on a scale as an alternate measure of T We calculate correlations between all possible pairs of items and create a variable that represents the average correlation between items We then adjust that average correlation to account for the fact that each observation is based on a correlation between single items, whereas the test score represents a composite of all of the test items (i.e., multiple measurements) Internal Consistency Reliability Xi 1 Recall that T = Therefore the more measurements we take, the closer their average is likely to be to the True Score When we create a composite Observed Score, based on multiple measurements (items), we enhance the Observed Score s reliability 12

Internal Consistency Reliability The relation between the number of items on a scale and the composite score s reliability is reflected in the Spearman-Brown Formula: k rxx rkk Where 1( k 1) rxx r kk =the corrected reliability estimate k=the factor by which the number of items is increased r xx =the uncorrected reliability estimate This assumes that additional items are equivalent measures of T Internal Consistency Reliability For example: A five-item scale has been found to have an internal consistency of.70 An investigator wants to estimate what the reliability would be if five more items are added (i.e., 5*k=10; k=2) k rxx rkk 1( k 1) rxx rkk 2. 7 1( 21). 7 rkk 14.. 17. 82 Internal Consistency Reliability Cronbach s Coefficient Alpha () is the most commonly used index of internal consistency It estimates a test s reliability based on the average correlation between all possible pairs of items on a scale, corrected for the fact that these correlations are based on single measurements (correlations between single items) One way to increase reliability is by adding items Short tests tend to be less reliable Single items are least reliable Inter-Scorer/Rater Reliability Some psychological measures require that scorers (or raters) exercise judgment Examples Scoring Rorschach variables Structured diagnostic interviewing Coding Behavioral Observations Random differences among scorers/raters contribute random error to the Observed Score resulting in lower reliability 13

Inter-Scorer/Rater Reliability Inter-Scorer/Rater reliability is determined by examining correlations between test scores across scorers/raters Examples: a set of Rorschach protocols is scored by two different individuals Two interviewers code the same set of responses to a structured diagnostic interview Two raters code the same video-recorded sample of behavior Inter-Scorer/Rater Reliability Note that the same set of data are scored in each instance If, for example, two psychologists administered and scored Rorschachs separately, we could be confounding Test-re-test and Inter-Scorer reliability as well as any interaction effects between the examiner and the test taker Inter-Scorer/Rater Reliability When two or more individuals actually generate the data to be scored (i.e., two separate structured psychiatric interviews are conducted and scored by two independent interviewers), then we evaluate Inter-examination Reliability. The same techniques are used, however the results may be interpreted differently because of the added effects of repeated measurement Estimating Reliability All four methods for estimating reliability (testretest, alternate forms, internal consistency, inter-scorer or inter-rater) are just that, they are estimates For various reasons already discussed, they are usually under-estimates of actual reliability In addition, if we use samples with restricted ranges on the variables of interest, this too will yield an under-estimate of reliability For example, if we use a non-clinical sample to estimate the reliability of a psychopathology measure 14

Estimating Reliability Applications of Reliability Estimating True Scores Establishing Confidence Intervals and the significance of the difference between test scores Estimating True Scores With knowledge of the following: an individual s Observed Score the reliability of that score and the Mean of that score in that individual s population it is possible to estimate her/his true score on the construct being measured Estimating True Scores T`=(1-r xx )*M X +r xx *X Where T`=Estimated True Score r xx = the test s reliability M X =population s mean observed score on X X=individual s raw score on X 15

Estimating True Scores T`=(1-r xx )*M X +r xx *X Note that: if r xx = 1, T`=X T`= (1-1)*M X +1*X = X (i.e., if a test is perfectly reliable, the true score actially equals the observed score) if r xx = 0, T`=M X T`= (1-0)*M X +0*X = Mx (i.e., if a test is perfectly unreliable, then our best estimate of the individual s true score [the one that s least likely to be wrong] is the population Mean) Estimating True Scores T`=(1-r xx )*M X +r xx *X Note that: if r xx = 1, T`=X T`= (1-1)*M X +1*X = X if r xx = 0, T`=M X T`= (1-0)*M X +0*X = M X As reliability decreases, the predicted True Score approaches the population mean This is known as Regression to the Mean due to measurement error Estimating True Scores T`=(1-r xx )*M X +r xx *X An individual obtains an IQ score of 120 If reliability is.90: T`=(1-.9)*100+.9*120=118 If reliability is.80: T`=(1-.8)*100+.8*120=116 If reliability is.70: T`=(1-.7)*100+.7*120=114 16

Establishing Confidence Intervals The estimated True Score represents the most accurate estimate of the actual True Score as constrained by the reliability of a measure As a measure s reliability approaches 1.0, our estimate of the True Score is more likely to be accurate Establishing Confidence Intervals With knowledge of the following: an individual s estimated True Score the reliability of that score AND the standard deviation of that score in the individual s population It is possible to identify a range of scores within which that individual s True Score is likely to fall at a given probability level This requires calculating the standard error of measurement Standard Error of Measurement SE M = X * 1 rxx SE M =Standard Error of Measurement X =Standard Deviation of X r XX =Reliability of X Standard Error of Measurement SE M = X * 1 rxx If r xx of an IQ test is.90 SE M =15.316=4.74 If r xx of an IQ test is.80 SE M =15.447=6.71 If r xx of an IQ test is.70 SE M =15.548=8.22 17

Confidence Intervals The Standard Error of Measurement can be used to place Confidence Intervals around the estimated True Score Based on the characteristics of the Normal Curve, we can determine that there is a 68% probability that the actual True Score will fall within one SE M above or below the estimated True Score and a 95% probability that it will fall within two SE M s about this score -2SEm -1SEm T +1SEm +2SEm Area Under the Normal Curve Confidence Intervals If an individual obtains an IQ score of 120 on a measure with estimated.90 reliability, then: Estimated True Score=118 68% Confidence Interval: 118 4.74 (113-123) 95% Confidence Interval: 118 9.48 (109-127) Note that confidence intervals range symmetrically about the estimated True Score (118), NOT the Observed Score (120) Confidence Intervals If an individual obtains an IQ score of 120 on a measure with estimated.80 reliability, then: Estimated True Score=116 68% Confidence Interval: 1166.71 (109-123) 95% Confidence Interval: 11613.42 (103-129) If an individual obtains an IQ score of 120 on a measure with estimated.70 reliability, then: Estimated True Score=114 68% Confidence Interval: 1148.22 (106-122) 95% Confidence Interval: 11416.44 (98-130) 18

Confidence Intervals If an individual obtains an IQ score of 75 on a measure with estimated.80 reliability, then: T`=(1-r xx )*M X +r xx *X T`=(1-.80)*100+.80*75=20+60=80 Estimated True Score=80 68% Confidence Interval: 806.71 (83-87) 95% Confidence Interval: 8013.42 (67-93) Hall v Florida 134 S. Ct. 1986 (2014) U.S. Supreme Court held that the SEM must be recognized when assessing intellectual disability and found Florida's bright-line statute unconstitutional. Validity Validity Validity of a test score reflects the extent to which it represents an individual s standing on the construct of interest Reliability of a test score reflects its precision Reliability sets the upper limit for validity, but has no bearing on its lower limit 19

Validity Unlike, reliability, establishing information about test-score validity is an ongoing process Validation is the ongoing process of accumulating information about the meaning of test scores Sound psychological testing and assessment practice requires that test users remain current on the product of ongoing test validation research Validity Validity is not a dichotomous property of test scores Rather, test scores may be more or less valid depending upon: their intended use AND the population with which the test is used Validity Examples: Intended use: IQ test scores may help identify children with special educational needs, however, they are not good predictors of special behavioral needs Population: An adult IQ test will not be very helpful in identifying a 10-year-old s special educational needs Validity Three primary sources may provide test score validity information: a. the test s content b. the test scores empirical associations with other variables c. a and b s relation with theory about the construct being measured These are also known respectively as Content, Criterion, and Construct Validity 20

Content Validity Content Validity reflects the extent to which test items canvas the relevant content domain Being able to define the relevant content domain is a crucial step in the process of establishing content validity A structured psychiatric interview designed to diagnose schizophrenia could be examined to determine the extent to which it includes questions that address all of the diagnostic criteria for the disorder Content Validity Content validity is not a static property of test scores As our understanding or definition of a construct s relevant content domain changes, so does the content validity of its measures For example, if we learn more about manifestations of PTSD or Psychopathy, content-based measures of these constructs need to be updated Content Validity Content validity can be sufficient, but is not necessary for test scores to be valid sufficient: When test score interpretation is content-based, adequate representation of the content domain is sufficient to support test score validity If a reliable structured psychiatric interview adequately canvases the current diagnostic criteria, a positive test score validly indicates the diagnosis not necessary: Scores on a test that does not satisfy content validity criteria may nonetheless be valid 21

Criterion Validity Criterion validity of test scores is established based on their empirical associations with other variables The other variables are the criteria. For a self-report test designed to yield valid psycho-diagnoses, correlations between test scores and actual diagnoses would provide information about criterion validity For a test designed to identify potentially problematic employees, future job ratings would provide criterion validity information Criterion Validity The two examples represent two sub-types of criterion validity: Concurrent Criterion Validity: Criteria are measured at the same time the test is administered Predictive Criterion Validity: Criteria are measured some time after the test is administered Prediction, in psychometric lingo, is not necessarily future oriented Predictive validity, on the other hand, is future oriented Criterion Validity Three major challenges in establishing test scores criterion validity: Identifying appropriate criteria Measuring them adequately and appropriately Determining their appropriate association with test scores Construct Validity The absence of a gold standard is one of the primary challenges to evaluating criterion validity In its absence, we do the best we can This challenge, led to the delineation of Construct Validity as an important psychometric feature of psychological test scores The term construct validity was coined in the 1954 APA Technical recommendations for Psychological tests and Diagnostic Techniques 22

Construct Validity Construct Validity defined conceptually: The extent to which test scores measure a theoretical construct or trait Examples of theoretical constructs: Scholastic Aptitude, Neuroticism, Intelligence, Leadership, Depression Definition is synonymous with definition of validity It is the research method and its underlying conceptualization that differ Construct Validity Theoretical foundations and research methods for elucidating test scores construct validity were presented by Cronbach and Meehl (1955) Both were members of the APA Committee on Psychological Tests that produced the technical recommendations Construct validity allows the test user to go beyond empirical correlates when interpreting test results Classification Accuracy Classification Accuracy Statistics Often, in assessment, we are called upon to make dichotomous predictions: Schizophrenia v. Not Schizophrenia Alcohol abuser v. Not alcohol abuser Genuine psychopathology v. Malingering Will act out violently v. Will not act out violently Will complete therapy v. Will not complete therapy Will attempt suicide v. Will not attempt suicide 23

Classification Prediction However, the measures we use, and the constructs they assess, tend to be continuous: o WAIS: Intelligence o CPI/MMPI/PAI: Personality traits/behavioral proclivities/psychopathology To make classification predictions, we establish cutoffs and create dichotomies: o Elevated versus non-elevated To establish cutoffs and establish their validity we collect data that allow for calculating classification accuracy statistics Classification Prediction Problem Occurs Yes No Elevated a b Test Score Non-elevated c d Classification Prediction Test Score Elevated Non-elevated Problem Occurs Yes No a c Desirable outcomes: a = Elevated score correctly predicts that a problem will occur d = Non-elevated score correctly predicts that a problem will not occur b d Classification Accuracy Statistics Base Rate (BR): Proportion of sample that actually has the predicted condition Selection Ratio (SR): Proportion of sample that is predicted to have the condition Sensitivity (Sen): Proportion of those who have the condition who are predicted to have the condition Specificity (Spec): Proportion of those who do NOT have the condition who are predicted NOT to have the condition Positive Predictive Power (PPP): Proportion of those predicted to have the condition who actually have the condition Negative Predictive Power (NPP): Proportion of those predicted NOT to have the condition who actually do not have the condition True Positive Rate (TPR): Proportion of those who have the condition who are predicted to have the condition (= Sensitivity) False Positive Rate (FPR): Proportion of those predicted to have the condition among those who actually do NOT have the condition (=1- Specificty) Hit Rate (HR): Proportion of correct predictions in overall sample 24

Classification Accuracy Statistics Test Score Elevated Non-elevated Problem Occurs Yes No a c a+c b+d BR=(a+c)/n PPP=a/a+b TPR=a/a+c HR=(a+d)/n SR=(a+b)/n NPP=d/c+d FPR=1-(d/(b+d)) Sen=a/a+c Spec=d/b+d b d a+b c+d n=a+b+c+d Classification Prediction A psychologist has been asked by the FAA to construct an assessment procedure designed to identify pilots who abuse alcohol The psychologist develops a test using large validation and cross-validation samples The psychologist indicates that this is a very effective test, in fact it identifies 95% of all pilots who abuse alcohol Classification Accuracy Actual Abuse Not Abuse Classification Accuracy Statistics Actual Positive Negative Prediction Abuse 475 4525 5000 Prediction Positive a b a+b Not Abuse 25 4975 5000 Negative c d c+d 500 9500 10000 a+c b+d n=a+b+c+d BR=(a+c)/n PPP=a/a+b TPR=a/a+c HR=(a+d)/n SR=(a+b)/n NPP=d/c+d FPR=1-(d/(b+d)) Sen=a/a+c Spec=d/b+d 25

Classification Accuracy Statistics Prediction Abuse Not Abuse Abuse a c 475 25 Actual Not Abuse b 500 9500 4525 5000 d 4975 5000 10000 BR=(a+c)/n = 500/10000=.05 PPP=a/a+b = 475/5000=.095 SR=(a+b)/n =5000/10000=.50 NPP=d/c+d = 4975/5000=.995 Sen=a/a+c = 475/500=.95 TPR= a/a+c = 475/500=.95 Spec=d/b+d = 4975/9500=.52 FPR=1-(d/(b+d)) = 1-(4975/9500)=.48 HR=(a+d)/n = 475+4975/10000=.54 Classification Accuracy Statistics BR=(a+c)/n = 500/10000=.05 PPP=a/a+b = 475/5000=.095 SR=(a+b)/n =5000/10000=.50 NPP=d/c+d = 4975/5000=.995 Sen=a/a+c = 475/500=.95 TPR= a/a+c = 475/500=.95 Spec=d/b+d = 4975/9500=.52 FPR=1-(d/(b+d)) = 1-(4975/9500)=.48 HR=(a+d)/n = 475+4975/10000=.54 Using a higher classification cutoff will typically improve specificity at the cost of specificity Lower specificity means higher false positives (FPR=1-Spec) Which of these is emphasized depends in part on the nature of the assessment: For screening purposes (leading to more detailed assessment): sensitivity is key to not missing actual cases more detailed assessment will (hopefully) further reduce false positives Classification Prediction Low base rate phenomena present a challenge for positive predictive power Why? Range Restriction! Implication: Some test classification accuracy characteristics established with one population will not apply to another population with a different BR 26

27