Technical Specifications

Size: px

Start display at page:

Download "Technical Specifications"

Horatio Jordan
6 years ago
Views:

1 Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically used in classroom testing summing the number of correct responses to exercises in order to yield a total score. In some cases, these total scores are converted to percentage correct and assigned a letter grade. In other cases, a single cut point is selected to identify learners who have met some pre-selected criterion (e.g., 80 percent). This traditional scoring model assumes that each correct response is equally important and, therefore, is given equal weight in arriving at the total score. This assumption results in an additive model, in which the total score emphasizes the relative amount of knowledge and skill demonstrated, with no attention paid to the pattern of correct and incorrect responses. The additive model extends from the less formal classroom tests to more formal standardized tests. Standardized tests frequently employ this additive model through the use of stanine and grade-equivalent scores. These standardized scores are determined through the use of particular norming populations and represent the transformation of a set of raw scores. Recent advances in computer technology and development of efficient algorithms have made the application of more powerful psychometric theories accessible to a wider range of users. Item Response Theory (IRT) is the model used in building and managing both the ETS PDQ Profile Series and the ETS Health Activities Literacy tests. Unlike traditional additive models, the IRT model allows us to estimate the difficulty level of a particular exercise relative to the difficulty of all other exercises in the assessments, as well as the estimation of an individual's proficiency level in the assessed domain. The IRT model is also able to take the pattern of right and wrong responses into account. More specifically, item response theory (IRT) is a mathematical model for the probability that a particular person will respond correctly to a particular item from a domain of items. The particular IRT model employed in these tests is the two-parameter logistic model. This probability is given as a function of a parameter characterizing the proficiency of that person, and two parameters characterizing the properties of that item difficulty and discrimination. One of the strengths of IRT models is that when their assumptions hold and estimates of the model s item parameters are available for the collections of items that make up the different test forms, all results can be reported directly in terms of the IRT proficiency. This property of IRT scaling removes the need to establish the comparability of numbercorrect score scales for different forms of the test. A pool of tasks over which performance is modeled and the accompanying proficiency variable are referred to as a "scale." Analyses within a scale are generally carried out in two steps: first, the parameters of the tasks are estimated; and second, the proficiency levels of individuals or groups are estimated with the item parameters estimated earlier. A unidimensional IRT model such as the two-parameter logistic model assumes that performance on all of the tasks in a particular domain can, for the most part, be accounted for by a single underlying proficiency variable. Since tasks are placed along an IRT scale on the basis of their characteristics, it is possible to address questions concerning the nature of these tasks and their position

2 on the scale relative to one another. For example, questions such as the following come to mind: Why do tasks fall at different levels along a given scale? Do tasks that cluster around a given scale point reflect similar interactions between materials and processing demands? Can tasks at different levels be distinguished in terms of these variables? Answers to questions such as these lead to a better understanding of the nature of the knowledge and skills underlying successful performance at various proficiency levels within the domain assessed. In addition to focusing attention on the distribution of tasks along a scale, the IRT model used in these tests also estimates the proficiency levels of individuals based on their patterns of right and wrong responses to the set of exercises that comprise the scale. This is unlike the traditional additive scoring model, which simply ranks individuals relative to their total score independently of which items they answered correctly or incorrectly. Establishing the PDQ literacy scales: linkage to the other surveys Prose, document, and quantitative literacy proficiencies are reported on scales that have been established across several national and international adult literacy surveys. Together, these surveys of literacy represent more than 800 items and over 145,000 respondents. The surveys include the: Young Adult Literacy Survey, Department of Labor Survey of Workplace Literacy, National Adult Literacy Survey, International Adult Literacy Survey, and the International Adult Literacy and Lifeskills Survey. These survey data are all matrix sampled; i.e., each respondent received only a subset of items out of a larger item pool according to a particular administration design. This sampling design enables us to have response data on large numbers of items while minimizing the response burden at the individual respondent level. The psychometric measurement model is used to seam all responses together to construct a single scale representation. The use of a psychometric model is critical since none of the surveys included all 800 items. While only a subset of items was used in any one survey, each survey contained a substantial number of items that were identical to at least one other survey, thus providing the necessary overlap to link the items. This overlap of items within multiple surveys enables us to establish and maintain common literacy scales across the various surveys. Items making up each of the literacy scales provide a pool from which to select subsets of items having appropriate characteristics for measuring an individual s literacy proficiency. Establishing the Health Activities Literacy Scale (HALS) The Health Activities Literacy Scale is organized around five general categories of health-related activities: Health promotion enhancing and maintaining one s own health including activities related to nutrition and exercise Health protection safeguarding the health of individuals and communities including health-related social and environmental issues.

3 Disease Prevention taking preventive measures (e.g., immunizations) & engaging in early detection such as screening programs Health Care & maintenance seeking care & forming a partnership with health providers Systems Navigation accessing needed services and understanding rights All of the items used in previous literacy assessments were reviewed by three researchers to select those that were judged to measure health-related activities. Then those researchers independently coded the191selected tasks into one of the five health-related activities. All differences were resolved through discussions and refinement of the coding criteria. The distribution of these tasks by type of health activity is shown in the following table. Health Activities and Number of Coded Items Health Activities Number of Items (n=191) Health Promotion 60 Health Protection 65 Disease Prevention 18 Health Care and Maintenance 16 Systems Navigation 32 The 191 health-related literacy tasks that were identified provided a link across the existing literacy surveys, and were used to create a new Health Activities Literacy Scale (HALS). Using IRT methodology, new item parameters were estimated for each of these tasks based on the responses of a nationally representative sample. The surveys from which the 191 health-related literacy tasks were selected represent different populations having various demographic characteristics. Current methodologies provide researchers with the tools needed to evaluate the performance of people even when they have been administered somewhat different tasks and when they represent different samples and populations studied over time. These methodologies have been used with student surveys such as the National Assessment of Educational Progress (NAEP) and the Programme for International Student Assessment (PISA) as well as the adult literacy surveys mentioned previously. Therefore, even though the populations studied varied somewhat across the different surveys, the subsets of literacy tasks and the scoring rubrics that were common across the surveys were kept constant and their item parameters checked for their stability across each of the surveys. Over the years, the same item parameters have been found to fit very well to each of the subpopulations within a country as well as across countries with different languages. Once the health-related literacy tasks had been scaled, the stability of the new item parameters was verified across each of the surveys to ensure

4 they fit well. More than 58,000 respondents from across the various adult surveys were used to estimate and verify the item parameters for the health-related tasks. Because the focus of the current study is the U.S. population, only data from the U.S were used. The model used for scaling the health literacy items from the NALS data is the twoparameter logistic (2PL) model from item response theory. The stability of the item parameters was checked across the various survey populations to ensure the comparability of the data and the stability of the newly established scale. The common item parameters were reviewed to ensure that they fit well in order to justify the use of the new item parameters and to establish the stability of the new HALS. Five different approaches were used to evaluate the stability of the item parameters including: A graphical method which allows us to observe the item characteristic curves for various populations; Three statistical indices which estimate the fit of each item for each population against the common item parameter (X2 statistic, the Root Mean Squared Deviation statistic, and the weighted Mean Deviation); and The impact of the item parameter on the overall proficiency estimate of a particular population. Deviations are based on the difference between model-based expected proportions correct and observed proportions correct at 41 equally spaced ability scale values. The fit of the health-related literacy tasks was remarkably good based on any conventional standard and, therefore, a single set of common item parameters could be used to describe all survey samples. HALS is a new scale. Even though it is based on pre-existing items from existing literacy surveys, the properties of this new scale had not been previously defined. That is, the scale could range from 0 to100, from 200 to 800 or within some other preselected range. The procedure to align the health literacy activities scale with the NALS scales was based on matching two moments of the proficiency distributions the mean and standard deviation. In this study, the provisional proficiency distribution based on the health scale was matched to the distribution of means of three NALS scale proficiency values (m= and sd=65.380). This allowed us to do a linear transformation that defines the HALS on a scale ranging from 0 to 500 having the same mean and standard deviation as the three NALS proficiency scales. One of the benefits of the HALS lies in the fact that it uses items from existing large-scale surveys of adults. Several researchers reviewed each literacy task to determine how well it fit into the five health activities described in this report. This adds content relevance to the scale because each item was judged to be representative of a type of health activity, thus focusing the measurement on tasks that broadly define health literacy rather than general literacy. Each of the 191 items that make up the HALS had been administered to nationally representative samples of adults. Because a large number of adults responded to each item, we were able to check how well each item behaves psychometrically. For example, each item was checked for differential performance by selected subgroups. In addition, each item was checked to determine how well it fits onto the overall scale. Other pieces of information relating to the validity of the HALS stem from our understanding of the construct and what contributes to the difficulty of each item and its position along the health scale. The NALS database links the HALS to an extensive set of background information. This link also contributes to the validation of

5 the HALS. Using this information, we are able to see the correlations between the HALS and a wide range of background characteristics that include age, gender, race/ethnicity, and level of education. Design of testlets and item selection Both the PDQ Profile Series and Health Activities Literacy Test include locator tests that employ adaptive testing. Standard adaptive testing is carried out at an item level. The general idea of adaptive testing is to administer an item which provides the most information with regard to the respondent's proficiency. The measurement model determines which criteria will be used. While there are many benefits to using adaptive test design, there are some issues that need to be considered. For example, item level adaptive testing is not necessarily ideal under all circumstances due to the higher overhead costs. The most serious concern regarding use of item level adaptation for a test like the PDQ Profile Series or the Health Activities Literacy Test is that multiple questions are based on a single stimulus material. Thus, Item level adaptation is not very feasible for an individual. However, it is possible to have what might be called a quasi-adaptive test that is made up of testlets. A testlet is formed with several stimulus/items sets. Selecting and administering a testlet to those within the optimal range of proficiency can result in most of the efficiency gain one might expect from a fully adaptive test while addressing the other limiting factors. The locator adaptive test for both the PDQ and HALS consists of four phases: 1) a small set of background information, 2) stage 1 cognitive items, 3) stage 2 cognitive items, and 4) reporting of results. The background information collected in phase 1 provides an initial estimate of where someone s proficiency may be before the administration of any cognitive items. This initial estimate is based on the relationships between background information and literacy proficiency seen in the large-scale assessment data. Incorporating this background information into the overall design makes the administration of cognitive items (Stage 1) more efficient. The cognitive items administered in Stage 2 will be selected based on the results from both the background information and Stage 1 testing. However, the results will be calculated and reported based only on the responses to the cognitive items administered in Stages 1 and 2. Background information will not be used in the computation of results that are reported. A primary difference between the locator test and the full-length test is the level of precision we obtain about where a person is on a particular literacy scale. For the locator test, the goal is simply to determine if each person is in Level 1, Level 2, or Level 3 and higher. With the full-length test our goal is to actually estimate proficiency on each literacy scale. This requires more cognitive test information. Therefore, we add another stage of testing. Like in the locator test, each individual is given a set of background questions. This is followed by first by a set of cognitive items (Stage 1) and then a second set of cognitive based on the responses to the first set. The full-length test has an additional set of cognitive items (Stage 3). This testlet is longer than the two earlier testlets since the purpose of this full-length test is to estimate the proficiency of each respondent as accurately as possible. Results are then calculated and reported based only on the cognitive items taken in stages 1 to 3.

6 Test Design Test Background Stage 1 Stage 2 Stage 3 Reporting Locator X X X X Full-Length X X X X X The scaling model The scaling model used for the PDQ Profile Series and the Health Activities Literacy tests is the two-parameter logistic (2PL) model from Item Response Theory (Birnbaum, 1968; Lord, 1980). It is a mathematical model for the probability that a particular person will respond correctly to a particular item from a single domain of items. This probability is given as a function of a parameter characterizing the proficiency of that person, and two parameters characterizing the properties of that item. The following 2PL IRT model was employed in the IALS: 1 P i (θ j ) = P(x ij =1θ j,a i,b i ) = exp( Da i (θ j b i )) where x ij is the response of person j to item i, 1 if correct and 0 if incorrect; j is the proficiency of person j (note that a person with higher proficiency has a greater probability of responding correctly); D is a normalizing constant and 1.7; a i is the slope parameter of item i, characterizing its sensitivity to proficiency; b i is its location parameter, characterizing its difficulty. Note that this is a monotone increasing function with respect to ; that is, the conditional probability of a correct response increases as the value of increases. In addition, a linear indeterminacy exists with respect to the values of j, a i, and b i for a scale defined under the two-parameter model. In other words, for an arbitrary linear transformation of say * = M+ X, the corresponding transformations a * i = a i /M and b * i = Mb i + X give: P(x ij =1θ j *,a i *,b i * ) = P(x ij =1θ j,a i,b i ) The transformation constants for the PDQ provisional scale were set in 1994 as (51.67, ) for Prose, (52.46, ) for Document, and (54.41, ) for Quantitative. These constants were set so that the total proficiency distributions of Young Adult

7 Literacy Survey have the mean of and standard deviation of All subsequent surveys used these transformations. Another main assumption of IRT is conditional independence. In other words, item response probabilities depend only on (a measure of proficiency) and the specified item parameters, and not on any demographic characteristics of respondent, or on any other items presented together in a test, or on the survey administration conditions. This extends to multiple languages and multiple assessments over time and enables us to place all items on one scale, even though these items were in multiple surveys administered to multiple populations over time. The assumption was monitored for every survey sample using 2 statistics, square root of weighted Mean Squared Deviation, and weighted Mean Deviation. This conditional independence enables us to formulate the following joint probability of a particular response pattern x across a set of n items. n P(x θ,a,b) = P i (θ ) x i (1 P i (θ )) 1 x i i =1 Replacing the hypothetical response pattern with the real scored data, the above function can be viewed as a likelihood function that is to be maximized with a given set of item parameters. These item parameters were treated as known for the subsequent analyses. Another assumption of the model is unidimensionality that is, performance on a set of items is accounted for by a single unidimensional variable. Although this assumption may be too strong, the use of the model is motivated by the need to summarize overall performance parsimoniously within a single domain. Hence, item parameters were estimated for each scale separately. Testing the assumptions of the IRT model, especially the assumption of conditional independence, is a critical part of the data analyses. The conditional independence means that respondents with identical abilities have a similar probability of producing a correct response on an item regardless of their group membership. Serious violation of the conditional independence assumption would undermine the accuracy and integrity of the results. It is a common practice to expect a portion of items to be found not suitable for a particular subpopulation. Thus, while the item parameters were being estimated, empirical conditional percentages correct were monitored across the samples. Estimation of Proficiency As described earlier, the PDQ scales were based on data from large-scale surveys. The purpose of a large-scale survey is to focus on the proficiency distributions of subpopulations rather than the proficiencies of individuals. However, the primary interest of PDQ is to estimate the proficiency of respondents. The proficiency estimation method chosen for PDQ is expectation of likelihood functions, which is approximated by the numerical integration of the following function. f (θ x j,a,b)

8 It is worth noting that this function was found to be more accurate than using maximum likelihood estimation since values at the extremes of the ability distribution are less stable. Reliability of proficiency scores The reliability of a test can been defined as the degree of consistency between two measures of the same thing. This is operationally defined as a measure of the degree of true-score variation relative to observed-score variation in classical test theory. An analogous extension to the IRT model would be the ratio of population variance minus measurement error divided by the overall population variance. One difficulty with this notion is that the measurement error in an IRT model is proficiency dependent and varies over the range of proficiency. In addition, the literacy tests described here are adaptive, meaning there are many combinations of items depending upon the proficiency of a given respondent. In order to characterize the consistency of measurement, an index of expectation of the measurement errors over proficiency was calculated and averaged across combinations of items used for the full-length tests. This resulted in the following estimates of reliability: Prose:.925 Document:.882 Quantitative:.883 Health Activities Literacy :.935 In contrast to the full-length test, the purpose of the locator test is to characterize the proficiency of respondents into one of three levels, based on a 0 to 500 point scale: whether they are performing below 225 (Level 1), between 225 and 275 (Level 2), or above 275 (Level 3 and above) as accurately as possible. The accuracy of classification depends on the distance each respondent's proficiency is from one of the cut points. For example, if the proficiency is 50 points away from the cut point, the accuracy is nearly perfect -.98 for all three scales. However, if the proficiency were just about on either side of a cut point, then the accuracy would be just about.50. The statistics below indicate the average accuracy of classifying respondents into these levels for each of the scales. These estimates are based on the average accuracy calculated at various distances in each of the levels. Literacy scale Locator accuracy Prose.880 Document.881 Quantitative.882 Health.931

9 Validity of scores Validity refers to the appropriateness, meaningfulness, and usefulness of the specific inferences made from test scores. The validity of the PDQ Profile Series and the Health Activities Literacy Test come from the various efforts that were undertaken in the development and conduct of the large-scale national and international assessments. Collectively, these efforts contribute to aspects of construct and criterion related validity. These activities include: A definition of literacy that was drafted by a national panel of experts and adopted by countries for use in the international assessments; The development of a framework that operationalized this definition, drawing on recent theories of reading and literacy; The development of literacy tasks that used everyday materials selected by panels of experts and reviewed both nationally and internationally; The administration of all items to large representative samples of adults covering a wide range of social and economic backgrounds and checked for bias; The conduct of item attribute studies to determine the kinds of knowledge and skills that are associated with successful performance; The linkage of each literacy scale to extensive background characteristics so that it is possible to examine the connection between these characteristics and performance on the various literacy scales. The definition and framework used to construct and understand the literacy scales form the body of information that supports the construct validity of these measures. See, for example, Kirsch, I., The International Adult Literacy Survey: Defining What Was Measured. Princeton, NJ: Educational Testing Service. RR The strong connections that are observed between these various literacy scales and the set of background characteristics support the criterion related validity of the measures. This type of information is printed in the reports and publications stemming from the large-scale assessments. A selected set of tables representing these relationships is provided here. The idea is that if the PDQ and HALS tests are measuring aspects of literacy that are important to participation in various aspects of society such as education and labor then respondents' performance on the PDQ and HALS should relate to these and other variables that are collected as part of the survey. All four scales are shown here to have strong positive relationships with respondents' highest level of education. Average Literacy Proficiency by Education Level Education level Prose Document Quantitative Health 0-8 yrs yrs GED/High School Some College yr College

10 degree 4yr College degree Graduate study/degree The relationship between respondents' age and proficiency are related less strongly and also curvilinearly. The average proficiency is highest among those who are between 35 and 44 years of age. It should be noted that document proficiency seems to have the strongest relationship with age as evidenced by the sharp decline of proficiency after 45. Average Literacy Proficiency by Age Age Prose Document Quantitative Health 16 to to to to to and older It was expected that country of birth would be strongly related to English literacy skills since it likely to reflect the language one first learns to speak and read. As seen below the average difference between those who were born in the US and those who were born in another country is about 60 to 70 points. Average Literacy Proficiency by Country of Birth Country of Birth Prose Document Quantitative Health Born in the USA Born in another country Examples of reading practices and literacy skills are also shown below. It is clear that those with higher literacy skills read newspapers far more often than others. There is a very strong indication that respondents who do not read newspapers have very low literacy skills. Average Literacy Proficiency by Newspaper Reading Practices Newspaper Reading Prose Document Quantitative Health Practices Every day A few times a

11 week Once a week Less than once a week Never The table below shows how literacy proficiency relates with employment status. Those respondents who are employed full time show much higher literacy skills than those who are unemployed or those who are out of the labor force. Average Proficiency by Labor Force Status Labor Force Status Prose Document Quantitative Health Full-time employed Part-time employed Unemployed Out of labor force In addition to employment status, the type of occupation has a strong relationship with literacy proficiency as is shown in the table below. The respondents who are employed as professionals or managers have much higher literacy skills than those who work as laborers. Average Proficiency by Most Recent Occupation Most Recent Prose Document Quantitative Health Occupation Professional/Managers Sales Craft Laborer

Scaling TOWES and Linking to IALS

Scaling TOWES and Linking to IALS Kentaro Yamamoto and Irwin Kirsch March, 2002 In 2000, the Organization for Economic Cooperation and Development (OECD) along with Statistics Canada released Literacy