Raters, rating and reliability

Size: px

Start display at page:

Download "Raters, rating and reliability"

Shawn Lindsey
5 years ago
Views:

1 Raters, rating and reliability

2 Test Taker Context Cognitive Response Scoring Validity Score/Grade Consequential Validity Criterion- Related Validity

3 How far can we depend on the scores which result from the test? our focus here is on Scoring validity

4 a super-ordinate term for all aspects of reliability (see Chapter 6 in Weigle 2002 and Chapter 5 in Shaw and Weir 2007) accounts for the extent to which test scores are based on appropriate criteria exhibit consensual agreement in marking as free as possible from measurement error stable over time consistent in terms of their content sampling inspire confidence as reliable decision-making indicators

5 linked directly to both cognitive and context validity test construct as a triangular relationship an interactionalist position, which sees the (writing) construct as residing in the interactions between the underlying cognitive ability, the context of use and the process of scoring

6 Rating criteria / rating scale rating procedures rater selection rater training standardisation rating conditions moderation statistical analysis raters grading and awarding

7 Writing task design Assessment criteria Validity authenticity Reliability Impact Practicality

8 Validity traditionally the most important examination quality concerns appropriateness & meaningfulness of an exam in specific educational context specific inferences made from exam results Task authenticity important aspect of validity Reliability contributes to overall validity concerns extent to which test results are stable, consistent, and free from bias and random error

9 Reliability is concerned with minimizing the effects of measurement error, while validity is concerned with maximizing the effects of the language abilities we want to measure (Saville 2003:69) Potential tension between validity and reliability in performance assessment.

10 Validity Direct testing Variety of task types More functions tested Fewer inferences required Administration complex More positive impact on teaching and learning Reliability Indirect testing Fewer task types Fewer functions tested More inferences required Administration simpler Less positive impact on teaching and learning

11 High reliability achieved by narrowing range of task types or range of skills tested. However, restricts interpretations placed on performance in the test, and hence its validity. The key, therefore, is to balance the potential tension between reliability and validity.

12 specification trialling Task standardisation format timing length

13 specification of the content of the assessment using pooled judgements to select content requiring multiple judgements adopting standard procedures basing judgements on specific defined criteria undertaking appropriate training checking validity, reliability by analysing assessment data

14 More test tasks More raters

15 What do you think makes a good rater in terms of: knowledge? skills? qualifications? background? experience? What are the minimum professional requirements?

16 What sort of training will the writing examiner need for their role? Familiarisation with test format and procedure Familiarisation with assessment criteria Initial induction and training, followed by ongoing standardisation/coordination How can this be achieved? Face-to-face Semi-direct Online

17 Let s look at an example of what one test provider does

18 Recruitment Induction Training Evaluation Coordination Monitoring

19 Managing the examiner community Cambridge ESOL Team Leaders Writing Examiners

20 the importance of rater training and standardisation Why? to reduce rater biases: leniency harshness halo effect (different types of halo effect) limited use of the scale

21 A system for routinely monitoring and evaluating the performance of examiners Giving feedback to examiners leading to possible follow-up action

22 to investigate and confirm the quality of raters scoring behaviour in Classical analyses: correlation coefficients % levels of agreement Rasch-based analyses: FACETS program inter/intra-rater reliability estimates, rating scale analysis, task analysis possible scaling of examiners before awarding scores/grades to correct for leniency/harshness

23 Test Taker Context Cognitive Response Scoring Validity Score/Grade Consequential Validity Criterion- Related Validity

Rating the construct reliably

EALTA Summer School, Innsbruck, 2016 Rating the construct reliably Jayanti Banerjee and Claudia Harsch Session Outline What is the rating process? Why do we need rater training? Rater training research