Empowered by Psychometrics The Fundamentals of Psychometrics Jim Wollack University of Wisconsin Madison
Psycho-what? Psychometrics is the field of study concerned with the measurement of mental and psychological traits, abilities, and attitudes.
Purpose of Session Introduce several key psychometric concepts and gain an appreciation for the theoretical underpinnings of standardized tests. Scales, Norms, and Equating Validity Reliability Test Theory (as time permits) Classical Test Theory Item Response Theory
Poll Question Because of my comfort-level with psychometrics, I am to discuss measurement-related issues with faculty and students. A. eager B. willing C. hesitant D. unwilling
SECTION I Scales, Norms, and Equating
Measurement and Scaling Measurement is the process of assigning scores in a systematic and coherent way For purposes of reporting, these scores are often transformed in some way to facilitate interpretations Scaling is the process of constructing a score scale that associates numbers or other ordered indicators with the performance of examinees.
Why Scale? Imagine three students (A, B, and C) take a standardized test Student A answered 20 of 30 items correct. Student B answered 20 of 29 items correct. Student C answered 21 of 30 items correct. What can we say about the achievement level of these three students?
What can we say about A, B, and C? A: 20/30 B: 20/29 C: 21/30 Suppose you learned that the students completed different test forms? the 3 hardest items were all on Test A? the 12 easiest items were all on Test C? The overall difficulty levels of two different tests are rarely identical Even if tests are of equal average difficulty, they may be differently difficult for students at different levels.
The Rationale Behind Scaling Raw scores (number correct scores) depend on the items on the test and do not have consistent meaning across forms. Same is true for percentage correct scores Makes score interpretations very difficult
Score Scales The score scale is the metric which is actually used for purposes of reporting scores to users. Moving from raw scores to the score scale involves either a linear or non-linear transformation The transformed scores themselves are called scaled scores or derived scores
Common Scales
Common Scales 16%
Advantage of Measurement Scales Standardization Scale must not measure differently depending on what it is that s being measured Pieces, bites, handfuls, and number/percent correct Pounds, inches, ºF, level on construct Without a standardized reporting metric, direct comparisons are impossible
Transforming between two scales Mean = 67 St. Dev = 8.5 Mean = 72 St. Dev = 5.7 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Transforming between two scales Mean = 67 St. Dev = 8.5 Mean = 72 St. Dev = 5.7 35 40 45 50 55 60 65 70 75 80 85 90 95 100 Linear transformation 1. Make means equal Add (72 67) = 5 to all blue scores
Transforming between two scales Mean = 72 St. Dev = 8.5 Mean = 72 St. Dev = 5.7 35 40 45 50 55 60 65 70 75 80 85 90 95 100 Linear transformation 1. Make means equal Add (72 67) = 5 to all blue scores
Transforming between two scales Mean = 72 St. Dev = 8.5 Mean = 72 St. Dev = 5.7 35 40 45 50 55 60 65 70 75 80 85 90 95 100 Linear transformation 2. Make St. devs equal Multiply all blue scores by (5.7/8.5)
Transforming between two scales Mean = 72 St. Dev = 5.7 Mean = 72 St. Dev = 5.7 35 40 45 50 55 60 65 70 75 80 85 90 95 100 Linear transformation 2. Make St. devs equal Multiply all blue scores by (5.7/8.5)
Any set of test scores can easily be transformed to some other metric. This allows for direct norm-referenced comparisons Candidate A scored 82 on red test (96 th percentile) Candidate B scored 82 on blue test (96 th percentile)
Transforming between two scales Mean = 72 St. Dev = 5.7 Mean = 72 St. Dev = 5.7 Score = 82 96 th percentile 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Transforming between two scales Are these two students comparable? Score = 82 96 th percentile 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Poll Question Are these two students comparable? A. Yes B. No C. It cannot be determined
Are two students comparable? We don t know. Students scaled scores and percentile ranks are relative to other students who completed that same form. If the populations of test takers were different, it is quite likely that the examinees are not of equal ability.
SAT GRE GED 600 SAT = 600 GRE = 600?
Norming Perform initial scaling on a single (base) test form (calibration) The sample completing the base form should be large and representative of the target population The sample taking the base form is used as a reference point for purposes of comparison with all subsequent samples Norming
Norm Group The norm group is the group of individuals for whom the test scale was established SAT is scaled to have an average of 500 the average of the norm group was 500
Test Equating Need to transform the data so that the scores candidates receive are the same scores they would have received if they Were part of the normative sample, and Had been administered the base form This process, known as equating, ensures that test scores have identical meaning across administrations, even as items and populations change.
Equating The process of determining the transformation to convert between the raw score metric and the reporting metric (based on the norm group) Equating is a topic worthy of a full-length graduate-level course Requires comparison across common elements between base form and new form : Common items Assumed randomly equivalent populations Very difficult assumption to make across years
Simple Equating Design Base Test Common New Test Items Average 500 495 510 500 515 Std. Dev 100 105 98 100 93 After Equating In much the same way we did before, we can now align these two assessments using only data from common items Add (510 495) = 15 to all New Test scores Multiple all New Test scores by 98 / 105 = 0.93
SECTION I Scales, Norms, and Equating Questions?
SECTION II Validity
Definitions of Validity Formal Definition of Validity: Degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of interpretations and actions based on test scores. Less Formal Definition of Validity: Degree to which the inferences made from a test (e.g., that the student knows the material of interest, is competent to practice in discipline, is likely to be successful in training/school, etc.) are justified and accurate. Informal Definition of Validity: A test is valid if it measures what it s supposed to measure
Problem with the Informal Definition Informal Definition of Validity: A test is valid if it measures what it s supposed to measure Tests cannot be valid Tests are valid only for specific purposes Placement/Admissions exams for exemption VSA Testing (differences in target population) Validity is a matter of degree
Measuring Educ/Psychological Variables Unlike physical attributes (height, weight, etc.), educational and psychological attributes cannot be directly observed, hence cannot be directly measured. To the extent that a variable is abstract and latent rather than concrete and observable, it is called a construct. In order to move from construct formation to measurement, the construct which is to be measured must first be operationally defined.
Embodiment of a Construct Operationalize What behaviors comprise the construct? How does this construct relate to or distinguish itself from other constructs? Plan How will samples of these behaviors be obtained? Instrument development Develop a standard procedure for obtaining the sample of behaviors from the specified domain. Measurement
Test Validation The degree to which the evidence supports the claim that the assessment measures the intended construct. Three types of validity? Construct Validity Criterion-Related Validity Content Validity No, just one. It s really all about Construct Validity
Assessing Construct Validity Target group differences Is there a logical differentiation between groups? e.g., placement test math scores for students who completed HS Calculus vs. those who only completed HS Algebra Correlational studies between test and related (or unrelated) measures Convergent validity: Does test correlate with other measures that are theoretically related? ACT and SAT, Compass & Accuplacer, Different IQ tests Divergent validity: Does test fail to correlate with other measures that are theoretically unrelated? ACT and Stanford Binet Math and English placement scores
Assessing Construct Validity Factor Analysis Statistical procedure to empirically test whether performance on observed variables (items) can be explained by a smaller number of unobserved constructs. Dimensionality assessment Does empirical structure match theoretical structure? Can also assess whether clusters of items are related in ways that are expected. Are items that are intended to measure the same subscores (e.g., trigonometry, algebra, etc.) more similar to each other than to other items?
Assessing Construct Validity Content Validity The extent to which the set of items on the test are representative of and relevant to the construct Items should cover the breadth and depth of the construct Weight assigned to each content area should reflect importance of that content area within construct For employment and certification exams, often necessary to conduct a practice analysis Panels of content experts are often utilized to assess relevance of items
Assessing Construct Validity Criterion-Related Validity Examines the relationship of the test results to other variables/criteria external to the test Predictive The extent to which an individual s future level on the criterion can be predicted from prior test performance Correlation between ACT/SAT scores and first year GPA Concurrent The extent to which test scores estimate an individual s present standing on the criterion. Correlation between Prior Learning Assessment and final course grade
SECTION II Validity Questions?
SECTION III Reliability
A Game of Darts Validity: Confidence that the test will hit the bullseye Reliability: Confidence that any one dart is a good predictor of where next dart would go. Clustering the Darts together
Unreliable
Reliable, but not valid
Reasonable reliable and valid
Highly reliable and valid
Reliability and Validity A test cannot be valid (for any purpose) unless it is reliable. Validity: Confidence that the test will hit the bullseye Not that it will average out to the bullseye
Working Definitions of Reliability The degree to which a test is consistent and stable in measuring what it is intended to measure Measurement repeatability Will an examinee score similarly when administered an independent alternate form of the test administered under the same conditions and with no opportunity for learning (or forgetting)?
Understanding Reliability No two tests will consistently produce identical results. All test scores contain some random error Observed Score = True Score + Random Error = Signal + Noise This equation is often written as X = T + E
What is Random Error? Any non-systematic source of variance that is unrelated to the construct of interest. Examinee-specific factors Motivation Concentration Fatigue Boredom Test-specific factors Specific questions Ambiguous items Memory lapses Carelessness Luck in guessing Clarity of directions Reading load of items Scoring-specific factors Non-uniform scoring Carelessness Computational errors
Formal Definition of Reliability X = T + E A measure of the extent to which an examinee s score reflects their true score, (as opposed to random measurement error) Reliability = Variance True / Variance Observed = 1 Variance Error / Variance Observed Test with reliability of.80 contains 20% random error
Reliability and the SEM If reliability is a measure of the stability of measurement, the standard error of measurement (SEM) provides a measure of the instability of measurement. SEM = (st. dev.) 1 Provides a measure of the expected variability in an individual s score (X i ) upon retesting. Score Interval Probability of score falling in interval X i ± SEM 68% X i ± 2 SEM 95%
Why Care About Reliability? Measurement error is random its effect on a student s test score is unpredictable. In an unreliable test, students scores consist largely of measurement error. An unreliable test offers no advantage over randomly assigning test scores to students. Reliability is a necessary precursor to validity
Estimating Reliability Test-Retest Reliability Administer the same exam to the same group of candidates and correlate the scores Interval should be short enough for no learning, and long enough for no remembering Parallel/Alternate Forms Reliability 1. Develop equivalent forms of the test. 2. Have examinees take both tests. 3. Correlate the scores.
Estimating Reliability: 1 Administration Split-Half Reliability 1. Split exam in two random halves 2. Correlate scores across the two halves. 3. Apply a formula to estimate reliability Internal Consistency Cronbach s Coefficient α, KR-20, KR-21 Measures of the extent to which the test items throughout a test are homogeneous α is average split-half reliability across all possible split-halves. α and KR-20 are lower-bound estimates of reliability
Reliability In Practice.90 1.00 High Stakes Standardized Testing.80.90 Subscores or Low-stakes tests.70.80.60.70.50.60.40.50.30.40.20.30.10.20.00.10
Improving Reliability Improve item quality Increase the number of points or item alternatives Increase the number of items
SECTION III Reliability Questions? End?
SECTION IV Test Theory Classical Test Theory Item Response Theory
Classical Test Theory X = T + E Person characteristics Total test score serves as a proxy for examinee s level on the construct Item characteristics Item difficulty is estimated as the proportion of examinees who answer an item correctly Item discrimination measures how effectively the item differentiates between high- and lowperforming examinees. Correlation between item score (1/0) and total score
Item Response Theory Mathematical modeling approach to test scoring and analysis Less intuitive, but more sophisticated approach Solves many problems with CTT Sample-dependency of item/exam statistics Test-dependency of total scores Tough to compare people and items Equal item weighting No good way to account for guessing
Trait Level vs. Prob. Correct Response 1.0 Item 1 0.8 Probability 0.6 0.4 0.2 0.0-3.0-2.0-1.0 0.0 1.0 2.0 3.0 θ (Examinee Trait Level) 63
An Item Characteristic Curve 1.0 0.8 Probability 0.6 0.4 0.2 0.0-3.0-2.0-1.0 0.0 1.0 2.0 3.0 θ (Examinee Trait Level) 64
Sample Independent Same Curve 1.0 0.8 Probability 0.6 0.4 0.2 0.0-3.0-2.0-1.0 0.0 1.0 2.0 3.0 θ (Examinee Trait Level) 65
Item Response Theory Directly models the probability of a candidate getting an item correct based on their overall level on the construct and item characteristics is the person s level on the construct a i, b i, and c i are item parameters corresponding to the item s discrimination, difficulty, and guessing likelihood
Item Difficulty 1.0 0.8 Probability 0.6 0.4.50 1.0 0.2 0.0-3.0-2.0-1.0 0.0 1.0 2.0 3.0 θ (Examinee Trait Level)
Item Difficulty 1.0 Probability 0.8 0.6 0.4.68.41 0.2 0.0-3.0-2.0-1.0 0.0 1.0 2.0 3.0 θ (Examinee Trait Level)
Item Discrimination 1.0 0.8.68 Probability 0.6 0.4 0.2.41.32.59.59.41 =.18.68.32 =.36 0.0-3.0-2.0-1.0 0.0 1.0 2.0 3.0 θ (Examinee Trait Level)
Accounting for Guessing 1.0 0.8 Probability 0.6 0.4 0.2 0.0-3.0-2.0-1.0 0.0 1.0 2.0 3.0 θ (Examinee Trait Level)
Putting it all Together 1.0 0.8 Probability 0.6 0.4 0.2 0.0-3.0-2.0-1.0 0.0 1.0 2.0 3.0 θ (Examinee Trait Level)
Test Characteristic Curve (TCC) Describes relationship between total test score and examinee trait level (θ) TCC is obtained by adding item characteristic curves across all values of θ Each test has its own TCC
PT Test Characteristic Curve A form with slightly easier items will shift the TCC to the left, requiring the examinee to answer a greater number of items correctly in order to pass
PT Test Characteristic Curve A form with slightly harder easier items will shift the TCC to the right, left, requiring the the examinee to answer a smaller greater number of items correctly in order to pass
3 Hypothetical TCCs Projected Test Score 200 180 160 140 120 100 80 60 40 20 0-3 -2-1 0 1 2 3 θ IRT is also independent of characteristics of the specific test form Easier (Top) Anchor (Middle) Harder (Bottom)
IRT Summary Although dealing with raw scores is conceptually appealing, it is problematic in practice IRT overcomes many of these problems IRT difficulty and person trait estimates are scaled together Item and person parameters are properties of the items and people, and do not change across samples or test forms. Majority of programs use IRT scoring and linearly transform θ to scale of interest
SECTION IV Test Theory Classical Test Theory Item Response Theory Questions?
Thank you For more information, please contact Jim Wollack University of Wisconsin Madison jwollack@wisc.edu