English Language Development Assessment (ELDA) 2004 Field Test Administration

Size: px

Start display at page:

Download "English Language Development Assessment (ELDA) 2004 Field Test Administration"

Kelley Brooks
6 years ago
Views:

1 English Language Development Assessment (ELDA) TECHNICAL REPORT 2004 Field Test Administration AND MEASUREMENT INCORPORATED Submitted to the Council of Chief State School Officers on behalf of the LEP-SCASS by the American Institutes for Research October 31, 2005 The contents of this document were developed under grant S368A030006, CFDA A, from the U.S. Department of Education. However, those contents do not necessarily represent the policy of the U.S. Department of Education, and you should not assume endorsement by the Federal Government.

3 TABLE OF CONTENTS EXECUTIVE SUMMARY... V INTRODUCTION DEVELOPMENT OF STANDARDS AND SPECIFICATIONS... 3 Development of Items... 4 Performance Level Descriptors... 5 Forms of PLDs... 5 Numbers of Levels and Range... 6 Focus of PLDs... 7 PLD Development Procedures... 8 Development of Field Test Forms Forms Design FIELD TEST SAMPLING PLAN LOGISTICAL ISSUES IN FIELD TEST ADMINISTRATION TEST ADMINISTRATION PROCEDURES Training Information Time to Take Each Field Test Accommodations Offered and Used ITEM ANALYSIS Scoring Procedures Classical Item Analysis Differential Item Functioning Analysis Review of Items Not Meeting Specified Standards Strategy for Modifying Problematic Items Field Test Form Analyses RASCH/IRT ANALYSIS Field Test Form Equating Vertical Linking VALIDITY STUDIES LONG-TERM OPERATION OF THE ELDA PROGRAM REFERENCES Spring 2004 Field Test Administration i American Institutes for Research

4 TABLE OF CONTENTS (CONTINUED) APPENDIX A: TEST BLUEPRINTS AND ITEM SPECIFICATIONS... A-3 APPENDIX B: PERFORMANCE LEVEL DESCRIPTORS... B-3 APPENDIX C: SAMPLING PROCEDURE FOR THE MARCH 2004 ELDA FIELD TEST ADMINISTRATION... C-3 APPENDIX D: ITEM STATISTICS AND DIF FOR ALL FIELD TESTED ITEMS... D-3 APPENDIX E: FREQUENCY DISTRIBUTION OF DIF CATEGORIES...E-3 APPENDIX F: FIELD TEST FORM ANALYSIS BY LANGUAGE GROUPS...F-3 APPENDIX G: ITEM DIFFICULTIES AND FIT STATISTICS FROM WINSTEPS (BEFORE VERTICAL LINKING)... G-3 APPENDIX H: ITEM POOLS WITH CALIBRATED STEP DIFFICULTIES (BEFORE VERTICAL LINKING)... H-3 APPENDIX I: VERTICAL LINKING...I-3 Written by Judit Antal, Mathina Calliope, Wen-Hung Chen, Phoebe Winter, and Steve Ferrara Spring 2004 Field Test Administration ii American Institutes for Research

5 LIST OF TABLES Table 1. Levels of Performance for ELDA... 7 Table 2. Indicators Defining Each Language Domain... 9 Table 3. Total Number of Items in ELDA Spring 2004 Field Test Table 4. Distribution of Item Types Table 5. Distribution of Item Types Table 6. General Test Administration in Time Guidelines Table 7. Summary of DIF Classification Rules Table 8. Summary of DIF Classification Rules for Polytomous Items Table 9. Number of Items Flagged on the Basis of Classical Item Analysis Table 10. Mean Coefficient Alpha Reliabilities Table 11. Number of Items Flagged for Misfitting Values Table 12. Number of Linking Items in Vertical Linking Analysis Spring 2004 Field Test Administration iii American Institutes for Research

6 Spring 2004 Field Test Administration iv American Institutes for Research

7 EXECUTIVE SUMMARY This report describes the March 2004 field test of the English Language Development Assessment (ELDA). The purpose of this field test was to ensure the development of an operational form by using a multi-stage review grounded in com-only accepted content and psychometric standards. This report details those standards and other procedures used to select items from the item pool for the operational form. The American Institutes for Research (AIR) collected data on items from field test forms across four skill domains and three grade clusters. In addition, the field test sample comprised four language groupings. The field test was designed to result in the construc-ion of the initial operational assessment form for implementation during the academic year. The 2004 field test assessments had the following components: Skill Domains: Listening, Reading, Speaking, Writing Grade Clusters: Grades 3 5, Grades 6 8, Grades 9 12 Language Groups: LEP Spanish, LEP Other, LEP Exited, Native English Speakers Field test forms were constructed according to test specifications developed by the Steering Committee in collaboration with AIR and members of the Limited-English-Proficient State Collaborative on Assessment and Student Standards (LEP-SCASS). Items were developed by qualified item writers to match specifications, and all items passed a rigorous review process before being included on field test forms. A large sample of students in grades 3 12 participated in the field test. Classical item analysis (IA) and differential item functioning (DIF) analyses were conducted to detect any potential test administration or scoring problems. Items flagged as too easy were found mostly for Native English Speakers as opposed to the most difficult items, which were found mostly for LEP Spanish students. Two items were identified as too difficult and 352 as too easy for Native English Speakers. For LEP Spanish students, 5 items appeared too easy and 19 were classified as too difficult. The highest number of severe item bias (DIF) were identified for LEP vs. Non-LEP contrast in grade cluster in 9-12 Reading. Test difficulty (p-value, or proportion correct value), biserial correlation, and omit rate were calculated for each test form. The average omit rate was moderately low across skill Spring 2004 Field Test Administration v Executive Summary

8 domains, grade clusters, test forms, and language groups. However, LEP Spanish and LEP Other students had a slight tendency to omit more items than non-lep students, which implies that they experienced difficulty answering test questions. The average biserial and polyserial correlations were moderately high. Reliabilities of the test forms were consistently high across all skill domains, grade clusters, and language groups. Items then passed through a two-stage review process consisting of reviews by a team of AIR psychometricians and by a Joint Committee formed by members of AIR, the Council of Chief State School Officers (CCSSO), Measurement Incorporated (MI), and the LEP-SCASS state membership. Recommendations were then made to include items in the operational pool, revise and resubmit items for administration in future field tests, or reject items from consideration. Only those items that passed all stages of review were included in the master item pool for operational form construction. AIR used Masters (1982) partial credit model to estimate ELDA item parameters. To implement randomly equivalent groups design, the IRT calibrations were conducted for each field test by setting the mean population ability to zero for each form in each grade cluster. Goodness of fit indices were also used to further analyze appropriateness of items. The highest number of misfit items (using Infit and Outfit statistics) were found in Reading, grade cluster 6-8 (26%). Through IRT calibration of the same common items embedded in the test forms of the adjacent grade clusters, it was possible to link the measurement scale of the three grade clusters into one scale. To ensure the quality of the linking item pool, AIR used a stepwise deletion procedure when computing the linking constant. Overall, 75% of the anchor items remained in the final linking item pool. Operational form construction was conducted jointly with AIR psychometricians, AIR content experts, and LEP-SCASS state member representatives. Form construction used the item locations from the Rasch/IRT analyses to pre-equate operational forms. This process resulted in the creation of an operational form for use in the school year. Spring 2004 Field Test Administration vi Executive Summary

9 INTRODUCTION This report describes the March 2004 field test of the English Language Development Assessment (ELDA). The purpose of this field test was to ensure the development of an operational form by using a multi-stage review grounded in commonly accepted content and psychometric standards. This report details those standards and other procedures used to select items from the item pool for the operational form. The purpose of this report is to provide a coherent overview of the March 2004 field test and the resultant pool of items used to construct the operational test form. ELDA is a battery of tests designed to allow schools to measure annual progress in the acquisition of English language proficiency skills among non-native English speaking students in The battery consists of separate tests for listening, speaking, reading, and writing, at each of three grade clusters: 3-5, 6-8, and The tests are aligned with the ESL standards of project member states and are developed to provide content coverage across three academic topic areas (English/Language Arts ELA; Math, Science, and Technology MST; and, Social Studies SS) and one non-academic topic area related to the school environment (School-Environmental S-E, which includes topics such as extra-curricular activities, student health, homework, classroom management, lunchtime, among many others). They are tests of language skills with content drawn from age-appropriate school curricular and non-curricular sources. They are not tests of academic content; in other words, no external or prior content-related knowledge is required to respond to test questions. Nor is performance on production skills tests scored in terms of content validity of response beyond what may be supplied in test input. While the main function of the ELDA tests is to measure annual progress in English language acquisition, they also permit the identification of students who have reached full English proficiency (FEP) or LEP-exit level, that is, a level considered appropriate for successful functioning within the school system at the appropriate grade level. It should be stressed that FEP is not intended as synonymous with native English speaker. The tests are not designed to provide placement information relative to English language courses or programs offered at a school. Nor are they designed to provide diagnostic feedback to students and their language teachers. Such functions could, at a future date, be integrated into the current battery. Spring 2004 Field Test Administration 1 Introduction

10 ELDA is designed to measure progress in the acquisition of English language proficiency across three grade clusters within grades The three clusters (3 5, 6 8, and 9 12) reflect common administrative clustering in many school systems, common clustering in other similar tests, and cognitive developmentally appropriate grouping. An important factor in decisions regarding grade clustering is the English language development characteristic of the target population diverse across 3-12 grades ranging from complete beginners to fully English proficient. Broad grade clustering, as determined for ELDA, allows for a more appropriate distribution of students across performance levels within each cluster than would have been possible with finer cluster distinctions. Broad grade clustering also reduces the challenges implied in vertical alignment procedures across clusters within each domain. As required under NCLB, ELDA contains separate tests for each of the four skills domains of listening, speaking, reading, and writing, for which separate scores need to be reported. The March 2004 field tests included assessments covering four skill domains across three grade clusters. In addition, the field test sample comprised four language groupings. The American Institutes for Research (AIR) gathered data by using two field test forms per grade cluster, resulting in 24 stand-alone field test forms. The use of multiple forms was required to produce an item candidate pool that was large enough to create one operational form that used only those items passing content and psychometric reviews. The March 2004 field test assessments had the following components: Skill Domains: Listening, Reading, Speaking, Writing Grade Clusters: Grades 3 5, Grades 6 8, Grades 9 12 Language Groups: LEP Spanish, LEP Other, LEP Exited, Native English Speakers The remainder of this report is organized into eight major sections. The first section is a brief description of the procedures used to construct field test forms for each skill domain and grade cluster combination to adequately represent relevant ESL standards and minimize the testing burden on participating schools. The second section describes the field test sampling plan. The third and the fourth sections detail logistical issues and test administration procedures. Section 5 explains the procedures used to analyze the field test data: classical item analysis and differential item functioning analyses and Section 6 describes the item response theory (IRT) Spring 2004 Field Test Administration 2 Introduction

11 techniques used to calibrate item difficulty by test form within grade clusters and across grade clusters for each skill domain. The last two sections include the validity studies and some recommendations regarding on-going development and maintenance of the psychometric integrity of ELDA. 1. DEVELOPMENT OF STANDARDS AND SPECIFICATIONS The design, development, and implementation of ELDA has been headed by the Council of Chief State School Officers (CCSSO) based in Washington D.C., specifically the Limited- English-Proficient State Collaborative on Assessment and Student Standards (LEP-SCASS). Nevada has headed the consortium of states. AIR has developed the test items and forms, and the Center for the Study of Assessment, Validity, and Evaluation (C-SAVE), based at the University of Maryland, has provided validity and reliability research to the project. No Child Left Behind, the 2001 Elementary and Secondary Education Act, was the impetus for the project, and six requirements highlighted in the act have shaped the architecture and conceptualization of ELDA. States must measure proficiency and show progress; assess all LEP students; independently measure the four skill domains of reading, writing, speaking, and listening; report a separate measure for comprehension; assess proficiency in academic language and in the language of social interaction; and align the assessments with their state Language Development (ELD) Standards. ELDA was aligned to state ESL standards through an analysis of the ESL standards of consortium states available to the project at the outset. From an analysis of state ESL standards for each of the four skills domains, AIR constructed and the LEP-SCASS approved a set of core ESL standards, which formed the basis for item design. The content that forms the basis for test items in ELDA is distributed across four topic areas. Approximately 25 percent of the items use language from each of the three curriculum domains of mathematics, science, and technology; English language arts; and social studies. The remaining 25 percent of the items use the social language of interaction between students, teachers, other school personnel, and parents related to school issues. Spring 2004 Field Test Administration 3 1. Development of Standards and Specifications

12 AIR proposed and the LEP-SCASS approved test item specifications for each of the four skill domain areas. Listening consists of only multiple-choice items. Students listen to five types of texts read by a narrator and actors (short phrases, short dialogues, extended dialogues, short presentations, and extended presentations; the 3-5 grade cluster excludes extended presentations because of developmental inappropriateness) and answer comprehension questions. Reading is also entirely multiple choice. Students read three types of text (short, early reading comprehension passages or cloze items; instructions; and long passages) and answer comprehension questions. Writing comprises both multiple-choice and short- and extendedconstructed-response items. Three broad standards editing, revising, and planning and organizing are assessed through multiple-choice items attached to short student-written passages. Speaking consists only of constructed-response items. Items come in sets of four prompts, each eliciting a different speaking function. Development of Items To develop items that measure these academic standards as specified by the content specifications, AIR brought together a highly competent pool of item writers, using a mix of external item writers, NAEP foreign language item writers, and other internal content experts. The LEP SCASS states recommended to AIR teachers who had experience with assessment development, and AIR contacted those teachers and selected them based on their availability. Bill Eilfort and Natalie Chen, assessment development consultants, worked to guide item writers during a weekend workshop in Denver in February The consultants, along with AIR staff, working in groups by domain and grade level, trained the item writers by explaining general item writing principles, and were available to help the participants develop their items. AIR also provided books and texts and other reference materials (such as Encarta online encyclopedia) for the item writers and they developed items individually and in small groups. After items were drafted and reviewed by the writers, they were entered into the review protocol as part of AIR s Item Tracking System (ITS) database. The following review levels were then conducted: Preliminary: Items reviewed by junior staff for formatting and basic item construction principles LABS: Items reviewed by a trained and certified LABS (language accessibility, bias, and sensitivity) reviewer Editor: Items reviewed for grammar, writing conventions and clarity Spring 2004 Field Test Administration 4 1. Development of Standards and Specifications

13 Senior: Items reviewed by a senior content expert in ESL or English language arts, evaluating the items for their match to the standards and for their measurement integrity Items that passed all these reviews were brought to LEP-SCASS meetings for review, comments, revision, and approval. At SCASS content review meetings, members split into grade-cluster groups and were instructed on the specifications for the items, the standards and the benchmarks, and individually reviewed the items before meeting as a group to accept, revise, reject or recommend revisions for resubmission all the items. Those items that survived the final review entered the field-test item pool. Performance Level Descriptors AIR has developed performance level descriptors (PLDs) for each of the four language skills tested in the ELDA battery listening, speaking, reading, and writing (see Appendix B). In addition to the requirement that independently obtained scores be reported for each of these four domains, federal regulations also require that a fifth score be reported for the composite skill of comprehension, derived from a combination of the listening and reading scores. As such, we have included comprehension as a fifth set of PLDs. The PLDs have undergone a series of review and revision procedures at AIR (outlined below) with both ELDA project staff and non-project staff. Forms of PLDs The PLDs exist in two forms, reflecting the two different functions that PLDs serve. One form is a matrix display of the descriptions of performance levels intended for standard-setting purposes. In this format we expect the cell-by-cell display of information to facilitate the task of capturing the distinctions between levels for each of the performance indicators that make up the language domain (the term indicators in this document refers to the set of component skills that define each of the language domains). The second form is a narrative of the information contained in the matrix intended for reporting purposes. In this format we expect the description to capture the character of a level across all the indicators, thus making it easier for stakeholders such as schools and parents to interpret test performance and progress in the Spring 2004 Field Test Administration 5 1. Development of Standards and Specifications

14 acquisition of English. This document first presents the matrix form of the PLDs, and then the narratives. The PLDs are common across the grade clusters for which the tests have been designed. The PLD for listening, for example, provides common performance level descriptions for the 3-5, 6-8 and 9-12 clusters. Underlying the notion of a common performance scale to describe English language proficiency development across 3-12 is the assumption that the same domain performance indicators, as well as the values given to each indicator at each level, can be used to describe language development in third grade as can be used in high school. We believe that this assumption is valid, that indicators such as text type, text-level structure, or fluency, and their values across performance levels, are generalizable across age or grade. What is variable across age or grade, with respect to test design and performance measurement, are specific ageappropriate test input materials (contexts for language use such as stimuli, the topics embedded in stimuli, specific features of grammar and vocabulary that are bound by cognitive development, and test graphics) and the cognitive skills required by the language tasks of the test. Numbers of Levels and Range It was determined at the Steering Committee meeting held at project start-up in Berkeley, California, in December 2002, that the PLDs should contain five levels of discrimination (see Table 1 below). Five levels were considered appropriate to (a) capture the construct of development in academic English language proficiency from Pre-functional to Full English Proficiency; (b) permit an acceptable resolution to the tension between cost effectiveness in test development and test administration efficiency versus psychometric needs; (c) permit students to show growth in English language development from one year to the next; and (d) permit students to reach the ultimate target of Full English proficiency within a realistic period of time. The levels range from Full English Proficiency, a level at which an LEP student is deemed to be able to function effectively and consistently through the medium of academic English in the school system (and thus ceases to be defined as LEP), to Pre-functional, a level at which an LEP student is consistently unable to communicate with any success in the English of the school environment, although may have some limited knowledge of English. It should be pointed out that the proficiency required for entry into the Full English Proficiency level is not Spring 2004 Field Test Administration 6 1. Development of Standards and Specifications

15 synonymous with that of native-speaker proficiency in English; FEP students may function effectively and successfully in the school system while still exhibiting a non-native speaker accent, while making production errors (which typically would not impede communication), or while comprehending less than the full range of subtle meanings intended by a writer or speaker (again, with little negative effect on communication). By contrast, many aspects of an Advanced level proficiency and some aspects of an Intermediate and even of a Beginners level proficiency allow for the demonstration of an ability to function effectively in the school system, with less and less consistency and sophistication as one moves down the scale. Table 1. Levels of Performance for ELDA Level Label 5 Full English Proficiency 4 Advanced 3 Intermediate 2 Beginners 1 Pre-functional Note: The labels used to define each level are provisional and pending approval by the LEP-SCASS members. Focus of PLDs The PLDs developed by AIR, in agreement with guidelines established by the LEP- SCASS members at the project start-up Steering Committee meeting, are intended to describe threshold points rather than the full range implied by a level; that is, each description characterizes what is minimally required for entry into a level. This is true of other second or foreign language scales of performance, including those of the American Council on the Teaching of Foreign Languages (the ACTFL Proficiency Guidelines), the Interagency Language Roundtable Proficiency Levels (ILR or Government Foreign Service scale), and the Council of Europe Proficiency Levels. The threshold approach is motivated by the constructs that are defined by the levels; that is, language skills that are cyclical and multidimensional and expanding patterns of learning rather than skills that are linear and unidimensional and learning at rate of complexity that is constant. For example, at level 1 a student may have no understanding of how to express present time in English, at level 2 a student may be able to express present time through the use of the present tense of some common verbs with simple adverbial present tense markers, at level 3 and beyond a student should have more extensive Spring 2004 Field Test Administration 7 1. Development of Standards and Specifications

16 ways of expressing present time with the use of a greater range of verbs and with more sophisticated time markers and with an ability to contrast present with other time reference. The model is often described in the literature as an inverted pyramid in which, as you progress up the scale, progressively more language skill is required to attain the next level. The PLD for the bottom level is an exception; it does not conform to the threshold requirement but rather is a description of a range, from zero knowledge and ability to just below what is minimally required for entry into level 2. The range implied by level 1, a pre-functional or pre-communicative level (to continue the pyramid metaphor, the apex of the pyramid), is relatively uncontentious to define. PLD Development Procedures An initial draft version of the PLDs for each of the four language skills was created at the project start-up Steering Committee meeting. Documents from the California State and New York State English Language Proficiency Levels, which represented substantive consideration of the issue of defining proficiency levels in the field of standards-aligned assessment for ESL, were consulted in this initial process. The draft version of the PLDs provided an important initial understanding and agreement of the type of characterization required at each of the five levels, reflecting a common understanding of the theoretical foundation for the descriptions. Particularly important was the determination of a working definition of level 5, Full English Proficient. The initial draft version of the PLDs, however, lacked vertical and horizontal alignment. They were substantially reviewed and revised during the test development process to achieve alignment, both within domain and across the four domains. This review and revision process involved the following steps: 1 Analyzing the original draft versions of the PLDs within a matrix to determine the performance indicators used to define each domain; 2 Assessing the degree to which the indicators represented a complete and theoretically sound definition of the domain and aligned with test design specifications and scoring rubrics for constructed response items, and then making appropriate revisions to the indicators; 3 Assessing the degree to which the indicators were vertically aligned across the five levels and then making appropriate revisions; Spring 2004 Field Test Administration 8 1. Development of Standards and Specifications

17 4 Assessing the degree to which the indicators were horizontally aligned across domains (particularly listening with reading, and speaking with writing) and then making appropriate revisions; 5 Analyzing both listening and reading PLDs to determine what may be considered common for the creation of comprehension PLDs; 6 Reviewing the content of all PLDs with internal AIR staff who are content experts but who are not related to the ELDA project; and 7 Submitting all PLDs to editorial review. Table 2 below summarizes the performance indicators identified for each of the five domains for which PLDs have been constructed: Table 2. Indicators Defining Each Language Domain Receptive Skills Productive Skills Performance Indicators Listening Reading Comprehension Speaking Writing 1 Text types content area/non-content area 2 Discou se types content area/non-content area 3 Speech types connect, tell, expand, reason content area/non-content area 4 Forming a general understanding main idea, theme, problem, conflict, plot, character, event, mood, message, purpose 5 Developing an understanding details 6 Linking information communicator point of view inference, conclusion, evaluation 7 Vocabulary and structure academic/school-social formal/informal 8 Use of vocabulary academic/school-social formal/informal 9 Text-level structure organization logic of argument cohesive devices Spring 2004 Field Test Administration 9 1. Development of Standards and Specifications

18 10 Table 3. Indicators Defining Each Language Domain (continued) Receptive Skills Productive Skills Performance Indicators Listening Reading Comprehension Speaking Writing Sentence-level structure tense modality word order inflection 11 Mechanics punctuation spelling capitalization 12 Fluency creativity, spontaneity, flexibility pronunciation Development of Field Test Forms This section briefly describes the construction of field test forms. Field test form construction is critical to ensuring a large item pool from which operational forms can be constructed. Errors in the field test form-construction phase can result in a depleted item pool or a mis-estimation of item parameters that perpetuates throughout the form-construction process. The specifications for the operational forms were used to develop specifications for the field test forms. AIR with the LEP-SCASS technical advisory committee (TAC) determined that two field test forms (A and B) for each grade cluster for each skill domain would most likely yield the number of items necessary for building one operational form. For Reading and Listening, each 3 5 and 6 8 field test form contained 60 items in contrast to the 50 items that were specified for the operational form. The 9 12 field-test form contained 75 items in contrast to 60 in the operational form. For Writing, Form A and Form B differed to maximize the number of constructed-response items field tested. For Speaking, the field test forms mirrored exactly the blueprints for the operational form. The bookmaps for the field-test forms for Reading, Writing and Listening, which provide descriptions of standards, item types and target difficulty levels, can be found in Appendix A, as well as the forms map, listing task names and shared tasks for Speaking. Spring 2004 Field Test Administration Development of Standards and Specifications

19 The next step was to construct field test forms from the items in the field test item pool. Content experts from AIR selected items from this pool to meet the requirements described in the test specifications. Guidelines for form assembly were the following: balance across keys and across topic areas, minimalization of item exposure, balance of ESL standards coverage specifications, and item/set influence on each other. Forms were submitted to the LEP-SCASS for review and approval, and AIR staff made adjustments on the basis of feedback. Forms Design AIR used a randomly equivalent groups design for the initial field test of ELDA items. This design allowed us to maximize the number of unique items administered in the field test while keeping the number of field test forms to a minimum. To field test new items in subsequent years, AIR will adopt a common item equating design in which field test items are embedded in operational test forms. Embedded field test designs have the advantage of requiring only one administration per school year, easing the burden on participating schools and students, and they allow us to obtain field test item parameter estimates under operational test administration conditions. In 2004, only multiple-choice items were included in the Reading and Listening test forms. Writing test forms included multiple-choice, extended constructed-response, and short constructed-response item formats. Speaking test forms comprised graphic prompts to which students provided oral responses. Student responses were recorded on audio tape for subsequent scoring. Table 3 specifies the total number of items appearing in the spring 2004 field test. For vertical linking, some items were shared across forms. Items appeared on both the 3-5 and 6-8 instruments, and on both the 6-8 and 9-12 instruments. To determine which items to share, content experts developed selection criteria: the shared items should cover the standards and there should be even text-type representation. Grade and curriculum appropriateness also was considered. For example, a passage shared across the 6-8 and 9-12 forms would not deal with high school graduation. Finally, difficulty level was considered. An attempt was made to choose the easiest 6-8 items to share with the 3-5 cluster or the most difficult 3-5 items to share with the 6-8 cluster. Spring 2004 Field Test Administration Development of Standards and Specifications

20 Table 3. Total Number of Items in ELDA Spring 2004 Field Test GradeCluster Form Listening Reading Speaking Writing 3 5 A B A B A B Table 4 provides information on item types and about the maximum score points available in each form. Table 4. Distribution of Item Types Listening GradeCluster Form MC Constructed-Response Items 2 Point CR 3 Point CR 4 Point CR Total Score Points 3 5 A B A B A B Table 5. Distribution of Item Types Readin g Speaking GradeCluster Form MC Constructed-Response Items Total Score Points 3 5 A B A B A B A B A B A B Spring 2004 Field Test Administration Development of Standards and Specifications

21 Table 5. Distribution of Item Types (continued) Writing GradeCluster Form MC Constructed-Response Items Total Score Points 3 5 A B A B A B For Reading and Writing, we spiraled the two field test forms within classrooms. For Listening and Speaking, we spiraled the two field test forms across classrooms within school for those schools administering field tests to more than one classroom. 2. FIELD TEST SAMPLING PLAN AIR developed a field test sampling plan for spring 2004 that was implemented by Measurement Incorporated (MI). The main purpose of spring 2004 field testing was to collect data for item screening and parameter calibration. The final goal was to construct one core ELDA form to be administered in the academic year. A target of 1,000 students per form was set to obtain reliable estimates for item screening and parameter estimation. The number of students requested from each state was determined by the number of states participating in the field test. Thirteen states participated in the spring 2004 field test (Alabama, Georgia, Iowa, Indiana, Kentucky, Louisiana, Nebraska, New Jersey, Ohio, Oklahoma, South Carolina, Virginia, and West Virginia). A sample of 2,000 students per grade cluster was drawn equally from the participating member states. The first sampling plan approved by the Technical Advisory Committee was distributed to the member states in December According to this design, schools within each state were clustered into four groups on the basis of the average English proficiency of their LEP students. Following this scheme, schools were coded as Low, Medium Low, Medium High, and High with respect to the overall English language proficiency of LEP students. Schools were then sampled proportionally to their group size, as measured by the total number of schools at each of the four levels of overall language proficiency. Within schools, students were selected to participate in the field test according to their native language background (LEP Spanish, LEP Other, LEP Exited, and native English speakers). Spring 2004 Field Test Administration Field Test Sampling Plan

22 In February 2004, however, some states indicated that they did not collect student performance data on English proficiency and were therefore not able to follow the sampling procedure. For this reason, a revised plan (see Appendix C) was developed and distributed to these states. The new plan required states to select students according to the number of LEP students enrolled. Following this plan, schools in each state were listed within each grade cluster in the order of the number of LEP students enrolled. The list of schools was then divided into equal thirds called Large, Medium, and Small, corresponding to the number of LEP students enrolled. Some questions arose when the states actually implemented this procedure. Later a supplement document was distributed to the member states to address those questions (see Appendix C). 3. LOGISTICAL ISSUES IN FIELD TEST ADMINISTRATION Materials for each student were put into a plastic bag. The bag included a Reading test booklet, Writing test booklet, Speaking test booklet, Listening test booklet, blank Speaking response tape, and Student Background Questionnaire. Materials for each teacher were put into a separate bag. These included the Speaking and Listening prompt cassettes and the Administration Manual. All materials in the teacher and student bags (with the exception of the Administration Manual, which was not secure) were security barcoded. Training Information 4. TEST ADMINISTRATION PROCEDURES The primary training vehicle was the Administration Manual. Also, a toll-free help line staffed with trained ELDA personnel was in effect through the duration of the project. Time to Take Each Field Test The field test was officially untimed. General guidelines for times were provided in the Administration Manual as follows: Table 6. General Test Administration in Time Guidelines Grade Cluster Listening Speaking Reading Writing hour, 20 mins 25 mins 1 hour 1 hour hour, 20 mins 25 mins 1 hour 1 hour hour, 40 mins 25 mins 1 hour, 15 mins 1 hour Spring 2004 Field Test Administration Logistical Issues in Field Test Administration

23 Accommodations Offered and Used The Administration Manual provided test administrators with guidelines for offering and using test accommodations. Specifically, any accommodations offered should be related to the student s specific disability. No accommodations were allowed that might change the content of the assessment. For example, defining words used in the writing or reading passages, any other stimulus materials, or the assessment questions were not considered appropriate. Accommodations in the administration procedures for ELDA were allowable provided that they were specified in a student s IEP or 504 plan and provided for the ELDA. Because a student s assessment results should reflect her or his true ability and should not be influenced by inappropriate accommodations, the administration manual emphasized that any accommodations should be consistent with practices routinely used in the student's instruction and assessment. Test administrators were instructed that any accommodations provided for an individual must be specified before the student takes the assessment and must be documented in the student s IEP. They were directed to contact their District Coordinator for additional state guidelines on accommodations for the ELDA. If a student with disabilities takes the ELDA, the administration of the assessment should be under standardized assessment conditions. Only those accommodations listed below or specifically identified in the student s IEP or 504 plan may be provided. Any accommodation provided to a student must be noted on the third page of his or her ELDA Student Background Questionnaire. The following accommodations may be provided to students with disabilities on the ELDA (in addition to any accommodations specified in the student s IEP or 504 plan): Computerized Assessment: Students may use a computer to type their responses instead of writing in their test booklets. Spell check, glossaries, grammar check, dictionaries and thesauruses are not allowed on the ELDA. Word processed responses should be stapled into the student s original test booklet. Dictation of Responses: Students who are unable to write due to a disability are allowed to dictate their responses to a transcriber or into an audio recorder for the Reading and Listening ELDA. The student s answers should be transferred into the student s original test booklet. A scribe may not be used for the Writing ELDA. Spring 2004 Field Test Administration Test Administraion Procedures

24 Extended/Adjusted Time: The ELDA is an untimed assessment. For students whose attention span or behavior interferes with regular testing sessions, test administration may be altered to allow for a number of shorter testing sessions. Testing may also be stopped and continued at a later time if behavior interferes with the testing session. The time of day the test is administered may also be adjusted to be most beneficial to the student. All testing sessions MUST be completed within the allotted testing window. Individual/Small Group Administration: Tests may be administered to a small group or an individual requiring more attention than can be provided in a large group administration. 5. ITEM ANALYSIS AIR received the data from MI for analysis. An initial evaluation of item quality and an examination of potential bias in item performance were conducted with classical item analysis (IA) and analysis of differential item functioning (DIF). Classical item analysis is a relatively straightforward approach to examining the quality of the items in each field test form. DIF analyses in language assessments are designed to determine whether students of similar levels of proficiency have different probabilities of answering items correctly (or receiving higher credit levels in the case of constructed-response items) because of language-group membership. In some cases, differential item functioning may indicate item bias. However, it is necessary for an external committee to review all items classified as having high levels of DIF to determine whether the item is unfair to members of various language subgroup populations. The following sections describe the steps of item analyses, which include classical item analysis, DIF analysis, and a review of those items not meeting the specified item statistics. The final section discusses analyses of test reliability and test difficulty. Scoring Procedures Experienced MI professional readers and scoring leadership did the Writing and Speaking handscoring. These same readers scored the 2005 census test that immediately preceded the field test. Writing training materials were identical to those that came out of the 2004 Rangefinding meetings held by the CCSSO in Boston and used in the 2004 field test and 2005 census test. According to contract guidelines, 10% of the writing responses were second read as a reliability check. Readers who scored the Speaking field test responses were trained using the same materials shipped to teachers who scored the 2005 Speaking census responses. Spring 2004 Field Test Administration Item Analysis

25 These materials were developed by MI and representatives from a number of SCASS states, including Nebraska and Louisiana. Classical Item Analysis In addition to evaluating the statistical properties of test items, item analysis also provides an opportunity to detect possible data errors before the analysis moves further. For each item, the item analysis yields the proportion of students selecting each response option for multiple-choice items or the proportion of students scoring in each response category for constructed-response items. The omit rate for each item is also calculated and includes both the percentage of students skipping the item and the percentage of students not reaching the item. Biserial correlations (un-adjusted ) for multiple-choice items and polyserial correlations for constructed-response items are used to examine the correlation between a student s performance on each item with the student s overall performance on the test form. For purposes of calculating item statistics, omitted and not-reached are treated as incorrect. For multiple-choice items, the proportion correct value (p-value) is calculated as the number of students who answer an item correctly divided by the total number of students. The biserial correlation of the keyed response is the correlation between the item score and the total test score. Biserial correlations are also calculated for distracters. The biserial correlation for distracters is the correlation between the item score, treating the target distracter as the correct response, and the total test score, restricting the sample to only those who chose either the target distracter or the keyed response (Attali & Fraenkel, 2000). For constructed-response items, we calculated the proportion of students falling into each score-point category defined by the item s scoring rubric (e.g., 0, 1, 2 for constructed-response items with three score categories; 0, 1, 2, 3, 4 for constructed-response items with five score categories). We computed item difficulty as the mean score on the item across all students taking the form. For both multiple-choice and constructed-response items, omit rates were also calculated. High rates of response omission may indicate confusion by test takers on how to respond to the item, confusion among readers about how to score the item, excessive test speededness, or an Spring 2004 Field Test Administration Item Analysis

26 item that is too difficult. Appendix D presents item statistics by language group for all field tested items. Differential Item Functioning Analysis AIR conducted analyses of differential item functioning (DIF) on all field test items to detect potential item bias across language groups. The purpose of these analyses is to identify items that may favor students in one group over students of similar ability in another group. For example, items that are more difficult for one language group (e.g., Spanish LEP students) may require background knowledge or skills that are less prevalent among these students than among other language group (e.g., Other LEP students) of similar ability, indicating potential item bias. This interpretation is referred to as construct irrelevance. AIR performs three DIF analyses for each item: (1) LEP Spanish versus LEP Other, (2) LEP Exited versus native English speakers, and (3) LEP (LEP Spanish and LEP Other combined) versus Non-LEP (LEP Exited and native English speakers combined). AIR employed the Mantel-Haenszel procedure whenever the sample size is greater than N = 300 for both groups (Holland & Thayer, 1986, 1992) to conduct DIF analyses for the field test. For detecting DIF for dichotomous items, the Mantel-Haenszel (MH) (Holland & Thayer, 1988; Mantel & Haenszel, 1959) and generalized Mantel-Haenszel (GMH; Mantel & Haenszel, 1959, Somes, 1986) procedures are the most popular DIF detection procedures used in educational measurement (Fidalgo, Mellenbergh & Muniz, 2000; Wang & Su, 2004). Recent research show that MH procedures are robust and have sufficient power at the 5% significant level at 10% DIF item conditions. MH test also maintains good control of its Type I error and is more powerful than the likelihood ratio test when the comparison group latent trait distributions are identical (Ankenmann, Witt, and Dunbar, 1999). Also, Wang and Su (2004) showed that MH and GMH both yield better control of Type I error than other methods when used with the Rasch model because they use number-correct scores as the matching variable. Furthermore, test length has little impact on their performance and can be used with smaller sample size. Total scores for each student on the test were used as the ability matching variable. MHdelta ˆ MH, the log-odds-ratio converted to the delta difficulty scale where 0 indicates no DIF, is then computed. Items are classified into three categories ranging from no DIF to mild DIF to Spring 2004 Field Test Administration Item Analysis

27 severe DIF according to the DIF classification convention established by the Educational Testing Service (Allen, Kline, & Zelenak, 1996) and summarized in Table 5. In the table, A refers to no DIF, B refers to mild DIF, and C refers to severe DIF. Table 7. Summary of DIF Classification Rules Classified Category Rule C ˆ is significantly larger than 1.0, and ˆ MH B ˆ MH is significantly larger than zero and either a) ˆ < 1.5, or MH b) ˆ is not significantly different from MH 1.5. MH A ˆ MH is not significantly different from zero, or ˆ < 1.0. MH For polytomous items, we calculated both the Mantel-Haenszel chi-square ( MH χ 2 ) (Zwick & Thayer, 1996; Zwick, Donoghue, & Grima, 1993) and the Standardized Mean Difference (SMD) index (Dorans & Kulick 1986). For constructed-response items, the classification rules are defined in Table 6. Appendix D exhibits the results of the three DIF analyses for each item. Table 8. Summary of DIF Classification Rules for Polytomous Items Classified Category C B A Rule SMD χ The p-value of MH is less than.05 and SD. The p-value of MH SMD 1.7 < 0.25 SD Otherwise 2 χ is less than.05 and Review of Items Not Meeting Specified Standards The item analyses provided information about the quality of the items. Items were flagged for review for four main reasons: 1. The proportion correct value is out of range [0.2, 0.9]. Spring 2004 Field Test Administration Item Analysis

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS Michael J. Kolen The University of Iowa March 2011 Commissioned by the Center for K 12 Assessment & Performance Management at