No part of this page may be reproduced without written permission from the publisher. (

Similar documents
Reliability. Internal Reliability

Chapter 3. Psychometric Properties

Saville Consulting Wave Professional Styles Handbook

Analysis of the Reliability and Validity of an Edgenuity Algebra I Quiz

Evaluating the Reliability and Validity of the. Questionnaire for Situational Information: Item Analyses. Final Report

Establishing Interrater Agreement for Scoring Criterion-based Assessments: Application for the MEES

Developmental Assessment of Young Children Second Edition (DAYC-2) Summary Report

Estimates of the Reliability and Criterion Validity of the Adolescent SASSI-A2

Variable Measurement, Norms & Differences

The Discovering Diversity Profile Research Report

Test review. Comprehensive Trail Making Test (CTMT) By Cecil R. Reynolds. Austin, Texas: PRO-ED, Inc., Test description

Internal structure evidence of validity

A Cross-validation of easycbm Mathematics Cut Scores in. Oregon: Technical Report # Daniel Anderson. Julie Alonzo.

A Breakdown of Reliability Coefficients by Test Type and Reliability Method, and the Clinical Implications of Low Reliability

Providing Highly-Valued Service Through Leadership, Innovation, and Collaboration. July 30, Dear Parents and Community Members:

Measurement and Descriptive Statistics. Katie Rommel-Esham Education 604

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria

Chapter Fairness. Test Theory and Fairness

Test Validity. What is validity? Types of validity IOP 301-T. Content validity. Content-description Criterion-description Construct-identification

CHAPTER 4: FINDINGS 4.1 Introduction This chapter includes five major sections. The first section reports descriptive statistics and discusses the

Personal Well-being Among Medical Students: Findings from a Pilot Survey

Construct Validation of Direct Behavior Ratings: A Multitrait Multimethod Analysis

How Many Options do Multiple-Choice Questions Really Have?

Examining the Psychometric Properties of The McQuaig Occupational Test

School / District Annual Education Report (AER) Cover Letter

Error in the estimation of intellectual ability in the low range using the WISC-IV and WAIS- III By. Simon Whitaker

Anger Management Profile (AMP)

The Short NART: Cross-validation, relationship to IQ and some practical considerations

Testing and Individual Differences


Cross-validation of easycbm Reading Cut Scores in Washington:

Student Needs Survey: A Psychometrically Sound Measure of the Five Basic Needs

ACDI. An Inventory of Scientific Findings. (ACDI, ACDI-Corrections Version and ACDI-Corrections Version II) Provided by:

Georgina Salas. Topics EDCI Intro to Research Dr. A.J. Herrera

BRIEF REPORT Test Retest Reliability of the Sensory Profile Caregiver Questionnaire

What Works Clearinghouse

equation involving two test variables.

Tech Talk: Using the Lafayette ESS Report Generator

Running head: CPPS REVIEW 1

CHAPTER VI RESEARCH METHODOLOGY

Please answer the following questions by circling the best response, or by filling in the blank.

Chapter 3 Psychometrics: Reliability and Validity

Emotional Intelligence Assessment Technical Report

Importance of Good Measurement

Technical Specifications

ITEM ANALYSIS OF MID-TRIMESTER TEST PAPER AND ITS IMPLICATIONS

August 10, School Name Reason(s) for not making AYP Key actions underway to address the issues McKinley Adult and Alternative Education Center

Review of Various Instruments Used with an Adolescent Population. Michael J. Lambert

Registered Radiologist Assistant (R.R.A. ) 2016 Examination Statistics

DISCPP (DISC Personality Profile) Psychometric Report

Federation of State Boards of Physical Therapy Minimum Data Set Questionnaire

Procedure of the study

and Screening Methodological Quality (Part 2: Data Collection, Interventions, Analysis, Results, and Conclusions A Reader s Guide

PTHP 7101 Research 1 Chapter Assignments

On the purpose of testing:

Reviewer: Jacob Bolzenius Date: 4/6/15. Study Summary Form Fields

Introduction to Reliability

Understanding CELF-5 Reliability & Validity to Improve Diagnostic Decisions

Lecture Week 3 Quality of Measurement Instruments; Introduction SPSS

CHAPTER ELEVEN. NON-EXPERIMENTAL RESEARCH of GROUP DIFFERENCES

Prevalence of Autism Spectrum Disorders --- Autism and Developmental Disabilities Monitoring Network, United States, 2006

Providing Highly-Valued Service Through Leadership, Innovation, and Collaboration

The Youth Experience Survey 2.0: Instrument Revisions and Validity Testing* David M. Hansen 1 University of Illinois, Urbana-Champaign

July 2019 Submission Formatting Information

Examining the efficacy of the Theory of Planned Behavior (TPB) to understand pre-service teachers intention to use technology*

An International Study of the Reliability and Validity of Leadership/Impact (L/I)

AN ANALYSIS ON VALIDITY AND RELIABILITY OF TEST ITEMS IN PRE-NATIONAL EXAMINATION TEST SMPN 14 PONTIANAK

Physics Department Student Climate Survey Report

How to Compute Mean Scores: CFAI-W

Assistant Superintendent of Business &

Factorial Validity and Consistency of the MBI-GS Across Occupational Groups in Norway

SAQ-Adult Probation III & SAQ-Short Form

SAMPLE. Conners Clinical Index Self-Report Assessment Report. By C. Keith Conners, Ph.D.

DESIGN TYPE AND LEVEL OF EVIDENCE: Randomized controlled trial, Level I

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

Gambler Addiction Index: Gambler Assessment

Psychologist use statistics for 2 things

Student Data Files in Electronic Format Data Set Record Layout

AP PSYCH Unit 11.2 Assessing Intelligence

Exiting Data Formatting Information

THE APPLICATION OF ORDINAL LOGISTIC HEIRARCHICAL LINEAR MODELING IN ITEM RESPONSE THEORY FOR THE PURPOSES OF DIFFERENTIAL ITEM FUNCTIONING DETECTION

LANGUAGE TEST RELIABILITY On defining reliability Sources of unreliability Methods of estimating reliability Standard error of measurement Factors

2004 MAKING ACHIEVEMENT POSSIBLE SURVEY SUMMARY REPORT

Chapter 4. The Validity of Assessment- Based Interpretations

Nashville HMIS Intake Template Use COC Funded Projects: HMIS Intake at Entry Template

Validity and reliability of measurements

CURRENT RESEARCH IN SOCIAL PSYCHOLOGY

Deaths in Hawaii Due to Major Cardiovascular Disease

Psychometric Evaluation of Self-Report Questionnaires - the development of a checklist

July 2018 Submission Formatting Information

Ch. 1 Collecting and Displaying Data

Making a psychometric. Dr Benjamin Cowan- Lecture 9

Evaluators Perspectives on Research on Evaluation

Statistical Significance, Effect Size, and Practical Significance Eva Lawrence Guilford College October, 2017

Gezinskenmerken: De constructie van de Vragenlijst Gezinskenmerken (VGK) Klijn, W.J.L.

Estimates of New HIV Infections in the United States

2008 Ohio State University. Campus Climate Study. Prepared by. Student Life Research and Assessment

Assessing the Validity and Reliability of the Teacher Keys Effectiveness. System (TKES) and the Leader Keys Effectiveness System (LKES)

December 1 Child Count Formatting Information

Transcription:

CHAPTER 4 UTAGS Reliability Test scores are composed of two sources of variation: reliable variance and error variance. Reliable variance is the proportion of a test score that is true or consistent, while error variance is associated with that proportion of a test score that is random and not associated with the construct assessed. Only reliable variance contributes to our understanding of an examinee s performance on a test; error variance does not predict outcomes, it does not correlate with other measures, and it adds uncertainty about the meaningfulness of the examinee s test score. Because the portion of a test score that is not reliable is error, it is important for test developers to maximize a test s reliability so as to minimize error and maximize the confidence one has in the resulting test scores. The study of test reliability centers on estimating the degree of variance that is known to be true or reliable, as opposed to the error associated with scores. When reliability is investigated, results are usually reported in terms of a coefficient that is a specific use of the common correlation coefficient and ranges from 0 (100% error) to 1.0 (i.e., perfect reliability). For instruments such as the UTAGS to be considered minimally reliable, their reliability coefficients must approximate or exceed.80 in magnitude (Bracken, 1987); coefficients of.90 or higher are considered the most desirable and most appropriate for tests designed for placement decisions (Aiken & Groth-Marnat, 2006; Bracken, 1987; Miller, Linn, & Gronlund, 2009; Reynolds, Livingston, & Willson, 2009; Salvia, Ysseldyke, & Bolt, 2010; Weiner & Craighead, 2010). Anastasi and Urbina (1997) described three contributors to variance associated with reliability: content (i.e., internal consistency), time (stability), and scorers (interscorer). In our investigation of the UTAGS s reliability, we calculated three types of coefficients: coefficient alpha for internal consistency, test-retest correlation coefficients for stability, and interrater correlations to assess scorer consistency. Each of these types of reliability provides evidence relating to Anastasi and Urbina s (1997) sources of test error (i.e., content sampling error, content heterogeneity error, time sampling error, and scoring error). INTERNAL CONSISTENCY Coefficient alpha estimates internal consistency reliability and demonstrates the extent to which test items correlate with the total scale or subscale score. Alpha estimates the amount of test-item consistency associated with content sampling and content homogeneity. Coefficient alphas for 25

UNIVERSAL TALENTED AND GIFTED SCREENER ADMINISTRATION MANUAL the UTAGS subscales were computed using Cronbach s (1951) method; coefficient alphas for the composite were computed by using Guilford s (1954, p. 393) formula designed for this purpose. Coefficient alphas for the UTAGS are reported in Table 6. Because the UTAGS had no significant age effects, and the normative tables are based on the entire normative sample, alphas were also computed on the entire normative sample. Table 6 shows that the UTAGS is exceptionally reliable. For instruments used for diagnosis and placement decisions, the criterion for acceptable reliability is.90 (Bracken, 1987). As can be seen, the coefficient alphas exceed this criterion and provide strong support for the reliability of the UTAGS. Thus, examiners may place confidence in any of the subscales when interpreting UTAGS results and making diagnostic decisions. The standard error of measurement (SEM) associated with coefficient alpha is an important statistic for examiners to use when interpreting scores. The SEM is the standard deviation of the error distribution associated with test scores. Attention to the meaning of this statistic is one way for an examiner to take into consideration the error variance that enters into the assessment situation. A test score is only an estimate of the student s true test performance; the extent to which that test score is accurate is due to reliability, and the extent to which it is inaccurate is due to test error. It is based on the formula SEM = SD 1- r (where SD = standard deviation; r = reliability), and establishes a zone within which an individual s true score probably lies. Table 6 provides the SEMs at three levels of confidence (i.e., 68%, 90%, and 95%) for all UTAGS scores. The clinical value of SEM can be illustrated by the student s scores reported in the case study from Chapter 2. Demarcus earned a General Aptitude Index score of 128. Thus, the examiner knows with 68% confidence (i.e., +/- SEM) that his true score lies between 126.5 and 129.5 (128 +/- 1.5 points), with 90% confidence that the true score lies between 125.06 and 130.94 (+/- 2.94 points), and with 99% confidence that the true score lies between 121.13 and 131.87 (+/- 3.87 points). Obviously, the smaller the SEM, the more confidence one can have in the accuracy of the obtained test score. For practical purposes, obtained scores are rounded. Because tests that are reliable for the general population may not be equally reliable for every subgroup within that population, evidence of test reliability must be provided for specific subgroups. The alphas for 12 selected subgroups within the normative sample are reported in Table 7. All of the alphas reported in this table exceed.90. These consistently high alphas demonstrate that the UTAGS is equally reliable for all subgroups investigated and support the premise that the test is equally accurate regardless of the examinee s gender, race, ethnicity, and exceptionality status. STABILITY Test-retest estimates of stability reflect the extent to which a student s test performance is consistent over time and are used to estimate time sampling error in a test. This reliability approach 26

UTAGS Reliability involves administering the test and then readministering it generally 2 to 4 weeks later. The degree of similarity between the two test scores over time indicates the degree of stability of the construct assessed by the test. Table 6 Coefficient Alpha and Standard Error of Measurement for UTAGS UTAGS Scores Alpha/CI Cognition Creativity Leadership Literacy Math Science General Aptitude α.98.98.98.98.99.98.99 68% CI for SEM 2.12 2.12 2.12 2.12 1.50 2.12 1.50 90% CI for SEM 2.71 2.71 2.71 2.71 1.92 2.71 1.92 95% CI for SEM 3.48 3.48 3.48 3.48 2.46 3.48 2.46 Note. CI = confidence interval. This kind of reliability was investigated for the UTAGS by having teachers rate 80 students between the ages of 5 and 17 years who were attending general public school classes in Tucson, AZ, Knoxville, TN, and Williamsburg, VA. Table 8 presents specific demographic information about the test-retest sample. After the testing was completed, the standard scores were computed for each form, and the scores for each testing were correlated. All coefficients were corrected for possible restriction or inflation of range. The resulting corrected coefficients, presented in Table 9, range from.84 to.96, demonstrating solid stability of the UTAGS results over time. SCORER CONSISTENCY Reliability among scorers of objective tests is understandably high. In such instances, the only possible type of error is clerical. Scorer error can be reduced considerably by the availability of clear administration procedures and detailed guidelines that govern scoring. Nevertheless, test authors should demonstrate statistically the amount of error in their test that is due to different scorers. In order to achieve this goal, authorities such as Anastasi and Urbina (1997) and Reynolds et al. (2009) recommended that two or more individuals score a set of tests independently. The mean correlation among the scorers yields an index of agreement. In the case of the UTAGS, two members of the PRO-ED staff who were familiar with the test s scoring procedures independently scored 31 complete UTAGS protocols drawn at random from the normative sample. The results of the two scorings were then correlated and corrected for restriction of range. The resulting coefficients were all.99. This coefficient is further evidence of the UTAGS s suitable scorer reliability. 27

UNIVERSAL TALENTED AND GIFTED SCREENER ADMINISTRATION MANUAL Table 7 Coefficients Alpha for Selected Subgroups on the UTAGS (Decimals Omitted) Subgroup Male Female White Black/ African American Asian/ Pacific Islander American/ Indian/ Eskimo/ Aleut Hispanic Other Gifted and Talented Intellectually Gifted/IQ > 120 Learning Disabled Intellectually Disabled/IQ < 80 UTAGS Score (n = 1,283) (n = 1,209) (n = 1,993 (n = 286) (n = 122) (n = 31) (n = 413) (n = 42) (n = 163) (n = 27) (n = 56) (n = 25) Subscale Cognition 98 98 98 99 97 99 99 99 98 93 99 98 Creativity 97 98 98 98 93 98 99 98 97 95 98 98 Leadership 98 98 98 97 96 98 98 98 97 95 98 97 Literacy 98 98 98 98 96 98 98 99 97 95 99 97 Math 98 98 98 99 97 99 99 99 98 95 99 99 Science 98 98 98 99 97 98 99 99 98 95 99 96 Composite General Aptitude 99 99 99 99 99 99 99 99 99 99 99 99 28

UTAGS Reliability Sample Characteristics Gender Table 8 Demographic Characteristics of the UTAGS Test-Retest Sample (N = 80) Male 48 Female 32 n Ethnicity White 62 African American 10 Asian/Pacific Islander 6 Other 2 Hispanic Yes 6 No 74 Table 9 Test-Retest Reliability Coefficients for UTAGS Time 1 Time 2 UTAGS Score M (SD) M (SD) r c (r u ) Subscale Cognition 102 (15) 101 (15).91 (.91) Creativity 102 (17) 102 (17).84 (.89) Leadership 100 (17) 100 (15).85 (.87) Literacy 101 (15) 101 (15).93 (.93) Math 100 (12) 100 (13).96 (.93) Science 99 (14) 100 (15).93 (.92) Composite General Aptitude 100 (14) 101 (14) 0.96 (.95) Note. N = 80. Coefficients corrected for range effects. M = mean; SD= standard deviation; r c = corrected correlation of test to retest; r u = uncorrected correlation of test to retest. 29

UNIVERSAL TALENTED AND GIFTED SCREENER ADMINISTRATION MANUAL CONCLUSIONS The overall reliability of the UTAGS is summarized in Table 10. The contents of this table show the test s status relative to three types of reliability coefficients: the coefficient alphas listed in the table are the median alphas reported at the top of Table 6; the test-retest coefficients are the corrected coefficients from Table 9; the scorer reliability coefficients were discussed in the previous section. The table also relates the three types of reliability coefficients to the different sources of possible test error described by Anastasi and Urbina (1997). As the figures listed in the table show, the UTAGS evidences a high degree of reliability. This high level of reliability is consistent across all three types of reliability studied. The magnitude of the reported coefficients, whether at the subscale or composite level, satisfies the most demanding standards, including those of Bracken (1987), Nunnally and Bernstein (1994), Salvia et al. (2010), and Reynolds et al. (2009). These findings strongly suggest that the UTAGS possesses relatively little measurement error and that test users can have considerable confidence in the test s resulting subscale and total composite scores. 30

UTAGS Reliability Table 10 Summary of UTAGS Reliability Relative to Three Types of Reliability (Decimals Omitted) Types of Reliability Coefficient UTAGS Score Coefficient Alpha Test-Retest Scorer Subscales Cognition 98 91 99 Creativity 98 84 99 Leadership 98 85 99 Literacy 98 93 99 Math 99 96 99 Science 98 93 99 Composite General Aptitude 99 96 99 Source of Test Error Content sampling, Content heterogenity Time Sampling Interscorer Differences Note. Sources of test error are from Psychological Testing (7th ed.), by A. Anastasi and S. Urbina, 1997, Upper Saddle River, NJ: Prentice Hall. 31