Demonstrating validity

Size: px

Start display at page:

Download "Demonstrating validity"

Meredith Moore
5 years ago
Views:

1 Demonstrating validity Nivja de Jong & Jelle Goeman

2 What is validity? Construct validity Criterion validity Face validity Content validity Consequential validity

3 Validity, back to basics Cattell, 1946; Kelley, 1927; Borsboom al., 2004: Whether an instrument actually measures what it is set out to measure Borsboom (2004): a test is valid for measuring an attribute if (a) the attribute exists and (b) variations in the attribute causally produce variation in the measurement outcomes

4 What is validity? Whether an instrument actually measures what it is set out to measure To demonstrate validity, we need theory that specifies the processes that bring about the causal effect between variations in the attribute and variation in measurement outcome Item difficulty should be theoretically predictable

5 What does a valid test look like? Item difficulty should be theoretically predictable Some form of unidimensionality of a test or sub-test is important because we want to summarize over all items Purposeful construction of items: sum-score should summarize

6 How do we check validity? In practice: what do we do about validity? Posthoc relate item difficulty to item characteristics? Correlation between test scores and scores from other tests? Unidimensionality: Cronbach s alpha? Is this OK?

7 Correlation between test scores and scores from other tests? A new test on English language proficiency should be strongly related to our old (previously validated) test of English language proficiency A new scale to measure weight should be strongly related to our old (previously validated) scale that measured weight

8 r =.8? We are measuring the same construct?

9 How do we check validity? In practice: what do we do about validity? Posthoc relate item difficulty to item characteristics? Correlation between test scores and scores from other tests? Unidimensionality: Cronbach s alpha? Is this OK?

10 Cronbach s alpha persisting confusion Cronbach s alpha is intended to measure reliability (test retest) Reliability is not the same as unidimensionality / internal consistency.

11 Cronbach s alpha persisting confusion Sijtsma, 2009: A single number alpha that expresses both reliability and internal consistency conceived of as an aspect of validity that suggests that items measure the same thing is a blessing for the assessment of test quality. In the meantime, alpha only is a lower bound to the reliability and not even a realistic one

12 Cronbach s alpha persisting confusion Cronbach s alpha is used as: Test retest Agreement between judges Unidimensionality across items In textbooks and handbooks: alpha >.x allows you to sum the scores But is this true? Does alpha reveal anything about summability?

13 Simulation 1 We simulate subjects two unrelated skills or one skill Each skill we test with k items We make a test with all 1k or 2k items Item score = ability + noise Ability explains 50% of variance of each item k ranges from 2 to 50. We calculate Cronbach s alpha for one and two skills within one test

14 Cronbach s alpha for one or two constructs within one test

15 Cronbach s alpha is an excellent measure of test length!

16 The concept of summability Unidimensionality? Alpha is not about unidimensionality but about reliability Factor analysis is a good alternative BUT do we actually look for unidimensionality? Example: miscoded multiple choice will happily load onto the strongest first factor Summability: We need the sum-score to summarize, not just any factor! items are purposefully constructed

17 Measuring summability How much of the variance of the item scores is captured by the sum-score? Our definition of summability: percentage of total item variance explained by the sumscore Like an R 2 in regression Comes in unadjusted and adjusted form

18 Summability formula Unadjusted: Adjusted: v is the sum of all item variances c is the sum of all item variances and covariances k is the number of items

19 Simulation 2 two unrelated skills or one skill Each skill tested with k items test with all 1k or 2k items Item score = ability + noise Ability explains 50% of variance of each item k ranges from 2 to 50. We compare Cronbach s alpha with Summability

20 Simulation 2: comparing Cronbach s alpha with Summability for one construct

21 Simulation 2: comparing Cronbach s alpha with Summability for two constructs

22 Recap valid test characteristics Item difficulty should be theoretically predictable Some form of unidimensionality of a test or sub-test is important because we want to summarize over all items Purposeful construction of items: sum-score should summarize

23 Variation between items Item difficulty should relate to the theoretically grounded item characteristics Example: in a multiple choice reading comprehension test texts differ in (linguistic) difficulty (sentence length,,, ) answer-options differ in plausibility In a valid MC reading test, (linguistic) difficulty of the texts predict item difficulty. This is easy to check with a correlation or regression analysis

24 Application 1: productive vocabulary knowledge 90 vocabulary items performed by 198 pps (binomial score). Item example: Het was gisteren dinsdag, dus is het v woensdag. Tested words taken from 10 frequency bands (rank , , ) Is this a valid test of vocabulary knowledge? Can we sum the scores? Can we predict item difficulty in a theoretically grounded manner? De Jong et al, 2012; Hulstijn et al., 2012

25 Application 1: productive vocabulary knowledge Summability of item-scores: Summability:.28 (Cronbach s alpha:.97) Item difficulty: Calculated as number of correct answers / number of all answers Related to (log) tested word frequency: R 2 =.45 De Jong et al, 2012; Hulstijn et al., 2012

26 Application 2: human ratings 100 judges rating on 5 different aspects (5 groups of 20 judges) for 90 speech samples: 1. Fluency (pauses, speed, and repairs) 2. Pausing 3. Speed 4. Repairs 5. Accentedness Are these valid measures of fluency/accent? Can we sum the scores for each group of 20 judges? Can we predict item difficulty in a theoretically grounded manner? Bosker et al., 2013; Pinget et al., submitted

27 Application 2: human ratings Summability of judge-scores for group 1 (fluency): Summability:.56 (Cronbach s alpha:.97) NB: Collapsing over 80 judges in groups 2 5 (pausing, speed, repairs, accentedness) Summability:.25 (Cronbach s alpha:.97) Item difficulty : mean judge score Calculated as mean score over all judges Related to a combination of objectively measured fluency characteristics of the speech samples: R 2 =.84 Bosker et al., 2013; Pinget et al., submitted

28 Discussion On summability: In the end a (sub)test is reduced to a single sum score; validity is relevant for that sum score Concept of summability: useful in language testing practice whenever scores are summarized with a single sum-score Whether.28 and.56 are high or low, more experience needed. On validity: Purposeful construction of items: item characteristics that are indentified beforehand must relate to post-hoc item difficulty

29 Questions?

In many fields, such as education or second language acquisition,

In many fields, such as education or second language acquisition, Educational Measurement: Issues and Practice xxxx 2018, Vol. 00, No. 0, pp. 1 10 How Well Does the Sum Score Summarize the Test? Summability as a Measure of Internal Consistency J. J. Goeman and N. H.