Technical Specifications
|
|
- Horatio Jordan
- 6 years ago
- Views:
Transcription
1 Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically used in classroom testing summing the number of correct responses to exercises in order to yield a total score. In some cases, these total scores are converted to percentage correct and assigned a letter grade. In other cases, a single cut point is selected to identify learners who have met some pre-selected criterion (e.g., 80 percent). This traditional scoring model assumes that each correct response is equally important and, therefore, is given equal weight in arriving at the total score. This assumption results in an additive model, in which the total score emphasizes the relative amount of knowledge and skill demonstrated, with no attention paid to the pattern of correct and incorrect responses. The additive model extends from the less formal classroom tests to more formal standardized tests. Standardized tests frequently employ this additive model through the use of stanine and grade-equivalent scores. These standardized scores are determined through the use of particular norming populations and represent the transformation of a set of raw scores. Recent advances in computer technology and development of efficient algorithms have made the application of more powerful psychometric theories accessible to a wider range of users. Item Response Theory (IRT) is the model used in building and managing both the ETS PDQ Profile Series and the ETS Health Activities Literacy tests. Unlike traditional additive models, the IRT model allows us to estimate the difficulty level of a particular exercise relative to the difficulty of all other exercises in the assessments, as well as the estimation of an individual's proficiency level in the assessed domain. The IRT model is also able to take the pattern of right and wrong responses into account. More specifically, item response theory (IRT) is a mathematical model for the probability that a particular person will respond correctly to a particular item from a domain of items. The particular IRT model employed in these tests is the two-parameter logistic model. This probability is given as a function of a parameter characterizing the proficiency of that person, and two parameters characterizing the properties of that item difficulty and discrimination. One of the strengths of IRT models is that when their assumptions hold and estimates of the model s item parameters are available for the collections of items that make up the different test forms, all results can be reported directly in terms of the IRT proficiency. This property of IRT scaling removes the need to establish the comparability of numbercorrect score scales for different forms of the test. A pool of tasks over which performance is modeled and the accompanying proficiency variable are referred to as a "scale." Analyses within a scale are generally carried out in two steps: first, the parameters of the tasks are estimated; and second, the proficiency levels of individuals or groups are estimated with the item parameters estimated earlier. A unidimensional IRT model such as the two-parameter logistic model assumes that performance on all of the tasks in a particular domain can, for the most part, be accounted for by a single underlying proficiency variable. Since tasks are placed along an IRT scale on the basis of their characteristics, it is possible to address questions concerning the nature of these tasks and their position
2 on the scale relative to one another. For example, questions such as the following come to mind: Why do tasks fall at different levels along a given scale? Do tasks that cluster around a given scale point reflect similar interactions between materials and processing demands? Can tasks at different levels be distinguished in terms of these variables? Answers to questions such as these lead to a better understanding of the nature of the knowledge and skills underlying successful performance at various proficiency levels within the domain assessed. In addition to focusing attention on the distribution of tasks along a scale, the IRT model used in these tests also estimates the proficiency levels of individuals based on their patterns of right and wrong responses to the set of exercises that comprise the scale. This is unlike the traditional additive scoring model, which simply ranks individuals relative to their total score independently of which items they answered correctly or incorrectly. Establishing the PDQ literacy scales: linkage to the other surveys Prose, document, and quantitative literacy proficiencies are reported on scales that have been established across several national and international adult literacy surveys. Together, these surveys of literacy represent more than 800 items and over 145,000 respondents. The surveys include the: Young Adult Literacy Survey, Department of Labor Survey of Workplace Literacy, National Adult Literacy Survey, International Adult Literacy Survey, and the International Adult Literacy and Lifeskills Survey. These survey data are all matrix sampled; i.e., each respondent received only a subset of items out of a larger item pool according to a particular administration design. This sampling design enables us to have response data on large numbers of items while minimizing the response burden at the individual respondent level. The psychometric measurement model is used to seam all responses together to construct a single scale representation. The use of a psychometric model is critical since none of the surveys included all 800 items. While only a subset of items was used in any one survey, each survey contained a substantial number of items that were identical to at least one other survey, thus providing the necessary overlap to link the items. This overlap of items within multiple surveys enables us to establish and maintain common literacy scales across the various surveys. Items making up each of the literacy scales provide a pool from which to select subsets of items having appropriate characteristics for measuring an individual s literacy proficiency. Establishing the Health Activities Literacy Scale (HALS) The Health Activities Literacy Scale is organized around five general categories of health-related activities: Health promotion enhancing and maintaining one s own health including activities related to nutrition and exercise Health protection safeguarding the health of individuals and communities including health-related social and environmental issues.
3 Disease Prevention taking preventive measures (e.g., immunizations) & engaging in early detection such as screening programs Health Care & maintenance seeking care & forming a partnership with health providers Systems Navigation accessing needed services and understanding rights All of the items used in previous literacy assessments were reviewed by three researchers to select those that were judged to measure health-related activities. Then those researchers independently coded the191selected tasks into one of the five health-related activities. All differences were resolved through discussions and refinement of the coding criteria. The distribution of these tasks by type of health activity is shown in the following table. Health Activities and Number of Coded Items Health Activities Number of Items (n=191) Health Promotion 60 Health Protection 65 Disease Prevention 18 Health Care and Maintenance 16 Systems Navigation 32 The 191 health-related literacy tasks that were identified provided a link across the existing literacy surveys, and were used to create a new Health Activities Literacy Scale (HALS). Using IRT methodology, new item parameters were estimated for each of these tasks based on the responses of a nationally representative sample. The surveys from which the 191 health-related literacy tasks were selected represent different populations having various demographic characteristics. Current methodologies provide researchers with the tools needed to evaluate the performance of people even when they have been administered somewhat different tasks and when they represent different samples and populations studied over time. These methodologies have been used with student surveys such as the National Assessment of Educational Progress (NAEP) and the Programme for International Student Assessment (PISA) as well as the adult literacy surveys mentioned previously. Therefore, even though the populations studied varied somewhat across the different surveys, the subsets of literacy tasks and the scoring rubrics that were common across the surveys were kept constant and their item parameters checked for their stability across each of the surveys. Over the years, the same item parameters have been found to fit very well to each of the subpopulations within a country as well as across countries with different languages. Once the health-related literacy tasks had been scaled, the stability of the new item parameters was verified across each of the surveys to ensure
4 they fit well. More than 58,000 respondents from across the various adult surveys were used to estimate and verify the item parameters for the health-related tasks. Because the focus of the current study is the U.S. population, only data from the U.S were used. The model used for scaling the health literacy items from the NALS data is the twoparameter logistic (2PL) model from item response theory. The stability of the item parameters was checked across the various survey populations to ensure the comparability of the data and the stability of the newly established scale. The common item parameters were reviewed to ensure that they fit well in order to justify the use of the new item parameters and to establish the stability of the new HALS. Five different approaches were used to evaluate the stability of the item parameters including: A graphical method which allows us to observe the item characteristic curves for various populations; Three statistical indices which estimate the fit of each item for each population against the common item parameter (X2 statistic, the Root Mean Squared Deviation statistic, and the weighted Mean Deviation); and The impact of the item parameter on the overall proficiency estimate of a particular population. Deviations are based on the difference between model-based expected proportions correct and observed proportions correct at 41 equally spaced ability scale values. The fit of the health-related literacy tasks was remarkably good based on any conventional standard and, therefore, a single set of common item parameters could be used to describe all survey samples. HALS is a new scale. Even though it is based on pre-existing items from existing literacy surveys, the properties of this new scale had not been previously defined. That is, the scale could range from 0 to100, from 200 to 800 or within some other preselected range. The procedure to align the health literacy activities scale with the NALS scales was based on matching two moments of the proficiency distributions the mean and standard deviation. In this study, the provisional proficiency distribution based on the health scale was matched to the distribution of means of three NALS scale proficiency values (m= and sd=65.380). This allowed us to do a linear transformation that defines the HALS on a scale ranging from 0 to 500 having the same mean and standard deviation as the three NALS proficiency scales. One of the benefits of the HALS lies in the fact that it uses items from existing large-scale surveys of adults. Several researchers reviewed each literacy task to determine how well it fit into the five health activities described in this report. This adds content relevance to the scale because each item was judged to be representative of a type of health activity, thus focusing the measurement on tasks that broadly define health literacy rather than general literacy. Each of the 191 items that make up the HALS had been administered to nationally representative samples of adults. Because a large number of adults responded to each item, we were able to check how well each item behaves psychometrically. For example, each item was checked for differential performance by selected subgroups. In addition, each item was checked to determine how well it fits onto the overall scale. Other pieces of information relating to the validity of the HALS stem from our understanding of the construct and what contributes to the difficulty of each item and its position along the health scale. The NALS database links the HALS to an extensive set of background information. This link also contributes to the validation of
5 the HALS. Using this information, we are able to see the correlations between the HALS and a wide range of background characteristics that include age, gender, race/ethnicity, and level of education. Design of testlets and item selection Both the PDQ Profile Series and Health Activities Literacy Test include locator tests that employ adaptive testing. Standard adaptive testing is carried out at an item level. The general idea of adaptive testing is to administer an item which provides the most information with regard to the respondent's proficiency. The measurement model determines which criteria will be used. While there are many benefits to using adaptive test design, there are some issues that need to be considered. For example, item level adaptive testing is not necessarily ideal under all circumstances due to the higher overhead costs. The most serious concern regarding use of item level adaptation for a test like the PDQ Profile Series or the Health Activities Literacy Test is that multiple questions are based on a single stimulus material. Thus, Item level adaptation is not very feasible for an individual. However, it is possible to have what might be called a quasi-adaptive test that is made up of testlets. A testlet is formed with several stimulus/items sets. Selecting and administering a testlet to those within the optimal range of proficiency can result in most of the efficiency gain one might expect from a fully adaptive test while addressing the other limiting factors. The locator adaptive test for both the PDQ and HALS consists of four phases: 1) a small set of background information, 2) stage 1 cognitive items, 3) stage 2 cognitive items, and 4) reporting of results. The background information collected in phase 1 provides an initial estimate of where someone s proficiency may be before the administration of any cognitive items. This initial estimate is based on the relationships between background information and literacy proficiency seen in the large-scale assessment data. Incorporating this background information into the overall design makes the administration of cognitive items (Stage 1) more efficient. The cognitive items administered in Stage 2 will be selected based on the results from both the background information and Stage 1 testing. However, the results will be calculated and reported based only on the responses to the cognitive items administered in Stages 1 and 2. Background information will not be used in the computation of results that are reported. A primary difference between the locator test and the full-length test is the level of precision we obtain about where a person is on a particular literacy scale. For the locator test, the goal is simply to determine if each person is in Level 1, Level 2, or Level 3 and higher. With the full-length test our goal is to actually estimate proficiency on each literacy scale. This requires more cognitive test information. Therefore, we add another stage of testing. Like in the locator test, each individual is given a set of background questions. This is followed by first by a set of cognitive items (Stage 1) and then a second set of cognitive based on the responses to the first set. The full-length test has an additional set of cognitive items (Stage 3). This testlet is longer than the two earlier testlets since the purpose of this full-length test is to estimate the proficiency of each respondent as accurately as possible. Results are then calculated and reported based only on the cognitive items taken in stages 1 to 3.
6 Test Design Test Background Stage 1 Stage 2 Stage 3 Reporting Locator X X X X Full-Length X X X X X The scaling model The scaling model used for the PDQ Profile Series and the Health Activities Literacy tests is the two-parameter logistic (2PL) model from Item Response Theory (Birnbaum, 1968; Lord, 1980). It is a mathematical model for the probability that a particular person will respond correctly to a particular item from a single domain of items. This probability is given as a function of a parameter characterizing the proficiency of that person, and two parameters characterizing the properties of that item. The following 2PL IRT model was employed in the IALS: 1 P i (θ j ) = P(x ij =1θ j,a i,b i ) = exp( Da i (θ j b i )) where x ij is the response of person j to item i, 1 if correct and 0 if incorrect; j is the proficiency of person j (note that a person with higher proficiency has a greater probability of responding correctly); D is a normalizing constant and 1.7; a i is the slope parameter of item i, characterizing its sensitivity to proficiency; b i is its location parameter, characterizing its difficulty. Note that this is a monotone increasing function with respect to ; that is, the conditional probability of a correct response increases as the value of increases. In addition, a linear indeterminacy exists with respect to the values of j, a i, and b i for a scale defined under the two-parameter model. In other words, for an arbitrary linear transformation of say * = M+ X, the corresponding transformations a * i = a i /M and b * i = Mb i + X give: P(x ij =1θ j *,a i *,b i * ) = P(x ij =1θ j,a i,b i ) The transformation constants for the PDQ provisional scale were set in 1994 as (51.67, ) for Prose, (52.46, ) for Document, and (54.41, ) for Quantitative. These constants were set so that the total proficiency distributions of Young Adult
7 Literacy Survey have the mean of and standard deviation of All subsequent surveys used these transformations. Another main assumption of IRT is conditional independence. In other words, item response probabilities depend only on (a measure of proficiency) and the specified item parameters, and not on any demographic characteristics of respondent, or on any other items presented together in a test, or on the survey administration conditions. This extends to multiple languages and multiple assessments over time and enables us to place all items on one scale, even though these items were in multiple surveys administered to multiple populations over time. The assumption was monitored for every survey sample using 2 statistics, square root of weighted Mean Squared Deviation, and weighted Mean Deviation. This conditional independence enables us to formulate the following joint probability of a particular response pattern x across a set of n items. n P(x θ,a,b) = P i (θ ) x i (1 P i (θ )) 1 x i i =1 Replacing the hypothetical response pattern with the real scored data, the above function can be viewed as a likelihood function that is to be maximized with a given set of item parameters. These item parameters were treated as known for the subsequent analyses. Another assumption of the model is unidimensionality that is, performance on a set of items is accounted for by a single unidimensional variable. Although this assumption may be too strong, the use of the model is motivated by the need to summarize overall performance parsimoniously within a single domain. Hence, item parameters were estimated for each scale separately. Testing the assumptions of the IRT model, especially the assumption of conditional independence, is a critical part of the data analyses. The conditional independence means that respondents with identical abilities have a similar probability of producing a correct response on an item regardless of their group membership. Serious violation of the conditional independence assumption would undermine the accuracy and integrity of the results. It is a common practice to expect a portion of items to be found not suitable for a particular subpopulation. Thus, while the item parameters were being estimated, empirical conditional percentages correct were monitored across the samples. Estimation of Proficiency As described earlier, the PDQ scales were based on data from large-scale surveys. The purpose of a large-scale survey is to focus on the proficiency distributions of subpopulations rather than the proficiencies of individuals. However, the primary interest of PDQ is to estimate the proficiency of respondents. The proficiency estimation method chosen for PDQ is expectation of likelihood functions, which is approximated by the numerical integration of the following function. f (θ x j,a,b)
8 It is worth noting that this function was found to be more accurate than using maximum likelihood estimation since values at the extremes of the ability distribution are less stable. Reliability of proficiency scores The reliability of a test can been defined as the degree of consistency between two measures of the same thing. This is operationally defined as a measure of the degree of true-score variation relative to observed-score variation in classical test theory. An analogous extension to the IRT model would be the ratio of population variance minus measurement error divided by the overall population variance. One difficulty with this notion is that the measurement error in an IRT model is proficiency dependent and varies over the range of proficiency. In addition, the literacy tests described here are adaptive, meaning there are many combinations of items depending upon the proficiency of a given respondent. In order to characterize the consistency of measurement, an index of expectation of the measurement errors over proficiency was calculated and averaged across combinations of items used for the full-length tests. This resulted in the following estimates of reliability: Prose:.925 Document:.882 Quantitative:.883 Health Activities Literacy :.935 In contrast to the full-length test, the purpose of the locator test is to characterize the proficiency of respondents into one of three levels, based on a 0 to 500 point scale: whether they are performing below 225 (Level 1), between 225 and 275 (Level 2), or above 275 (Level 3 and above) as accurately as possible. The accuracy of classification depends on the distance each respondent's proficiency is from one of the cut points. For example, if the proficiency is 50 points away from the cut point, the accuracy is nearly perfect -.98 for all three scales. However, if the proficiency were just about on either side of a cut point, then the accuracy would be just about.50. The statistics below indicate the average accuracy of classifying respondents into these levels for each of the scales. These estimates are based on the average accuracy calculated at various distances in each of the levels. Literacy scale Locator accuracy Prose.880 Document.881 Quantitative.882 Health.931
9 Validity of scores Validity refers to the appropriateness, meaningfulness, and usefulness of the specific inferences made from test scores. The validity of the PDQ Profile Series and the Health Activities Literacy Test come from the various efforts that were undertaken in the development and conduct of the large-scale national and international assessments. Collectively, these efforts contribute to aspects of construct and criterion related validity. These activities include: A definition of literacy that was drafted by a national panel of experts and adopted by countries for use in the international assessments; The development of a framework that operationalized this definition, drawing on recent theories of reading and literacy; The development of literacy tasks that used everyday materials selected by panels of experts and reviewed both nationally and internationally; The administration of all items to large representative samples of adults covering a wide range of social and economic backgrounds and checked for bias; The conduct of item attribute studies to determine the kinds of knowledge and skills that are associated with successful performance; The linkage of each literacy scale to extensive background characteristics so that it is possible to examine the connection between these characteristics and performance on the various literacy scales. The definition and framework used to construct and understand the literacy scales form the body of information that supports the construct validity of these measures. See, for example, Kirsch, I., The International Adult Literacy Survey: Defining What Was Measured. Princeton, NJ: Educational Testing Service. RR The strong connections that are observed between these various literacy scales and the set of background characteristics support the criterion related validity of the measures. This type of information is printed in the reports and publications stemming from the large-scale assessments. A selected set of tables representing these relationships is provided here. The idea is that if the PDQ and HALS tests are measuring aspects of literacy that are important to participation in various aspects of society such as education and labor then respondents' performance on the PDQ and HALS should relate to these and other variables that are collected as part of the survey. All four scales are shown here to have strong positive relationships with respondents' highest level of education. Average Literacy Proficiency by Education Level Education level Prose Document Quantitative Health 0-8 yrs yrs GED/High School Some College yr College
10 degree 4yr College degree Graduate study/degree The relationship between respondents' age and proficiency are related less strongly and also curvilinearly. The average proficiency is highest among those who are between 35 and 44 years of age. It should be noted that document proficiency seems to have the strongest relationship with age as evidenced by the sharp decline of proficiency after 45. Average Literacy Proficiency by Age Age Prose Document Quantitative Health 16 to to to to to and older It was expected that country of birth would be strongly related to English literacy skills since it likely to reflect the language one first learns to speak and read. As seen below the average difference between those who were born in the US and those who were born in another country is about 60 to 70 points. Average Literacy Proficiency by Country of Birth Country of Birth Prose Document Quantitative Health Born in the USA Born in another country Examples of reading practices and literacy skills are also shown below. It is clear that those with higher literacy skills read newspapers far more often than others. There is a very strong indication that respondents who do not read newspapers have very low literacy skills. Average Literacy Proficiency by Newspaper Reading Practices Newspaper Reading Prose Document Quantitative Health Practices Every day A few times a
11 week Once a week Less than once a week Never The table below shows how literacy proficiency relates with employment status. Those respondents who are employed full time show much higher literacy skills than those who are unemployed or those who are out of the labor force. Average Proficiency by Labor Force Status Labor Force Status Prose Document Quantitative Health Full-time employed Part-time employed Unemployed Out of labor force In addition to employment status, the type of occupation has a strong relationship with literacy proficiency as is shown in the table below. The respondents who are employed as professionals or managers have much higher literacy skills than those who work as laborers. Average Proficiency by Most Recent Occupation Most Recent Prose Document Quantitative Health Occupation Professional/Managers Sales Craft Laborer
Scaling TOWES and Linking to IALS
Scaling TOWES and Linking to IALS Kentaro Yamamoto and Irwin Kirsch March, 2002 In 2000, the Organization for Economic Cooperation and Development (OECD) along with Statistics Canada released Literacy
More informationGENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS
GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS Michael J. Kolen The University of Iowa March 2011 Commissioned by the Center for K 12 Assessment & Performance Management at
More informationEmpowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison
Empowered by Psychometrics The Fundamentals of Psychometrics Jim Wollack University of Wisconsin Madison Psycho-what? Psychometrics is the field of study concerned with the measurement of mental and psychological
More informationLinking Assessments: Concept and History
Linking Assessments: Concept and History Michael J. Kolen, University of Iowa In this article, the history of linking is summarized, and current linking frameworks that have been proposed are considered.
More informationA framework for predicting item difficulty in reading tests
Australian Council for Educational Research ACEReSearch OECD Programme for International Student Assessment (PISA) National and International Surveys 4-2012 A framework for predicting item difficulty in
More informationESTIMATING PISA STUDENTS ON THE IALS PROSE LITERACY SCALE
ESTIMATING PISA STUDENTS ON THE IALS PROSE LITERACY SCALE Kentaro Yamamoto, Educational Testing Service (Summer, 2002) Before PISA, the International Adult Literacy Survey (IALS) was conducted multiple
More informationComputerized Mastery Testing
Computerized Mastery Testing With Nonequivalent Testlets Kathleen Sheehan and Charles Lewis Educational Testing Service A procedure for determining the effect of testlet nonequivalence on the operating
More informationUSE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION
USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION Iweka Fidelis (Ph.D) Department of Educational Psychology, Guidance and Counselling, University of Port Harcourt,
More informationContents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD
Psy 427 Cal State Northridge Andrew Ainsworth, PhD Contents Item Analysis in General Classical Test Theory Item Response Theory Basics Item Response Functions Item Information Functions Invariance IRT
More informationA Comparison of Several Goodness-of-Fit Statistics
A Comparison of Several Goodness-of-Fit Statistics Robert L. McKinley The University of Toledo Craig N. Mills Educational Testing Service A study was conducted to evaluate four goodnessof-fit procedures
More informationCHAPTER VI RESEARCH METHODOLOGY
CHAPTER VI RESEARCH METHODOLOGY 6.1 Research Design Research is an organized, systematic, data based, critical, objective, scientific inquiry or investigation into a specific problem, undertaken with the
More informationCenter for Advanced Studies in Measurement and Assessment. CASMA Research Report
Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 39 Evaluation of Comparability of Scores and Passing Decisions for Different Item Pools of Computerized Adaptive Examinations
More informationUsing the Rasch Modeling for psychometrics examination of food security and acculturation surveys
Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys Jill F. Kilanowski, PhD, APRN,CPNP Associate Professor Alpha Zeta & Mu Chi Acknowledgements Dr. Li Lin,
More informationItem Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses
Item Response Theory Steven P. Reise University of California, U.S.A. Item response theory (IRT), or modern measurement theory, provides alternatives to classical test theory (CTT) methods for the construction,
More informationLinking Errors in Trend Estimation in Large-Scale Surveys: A Case Study
Research Report Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study Xueli Xu Matthias von Davier April 2010 ETS RR-10-10 Listening. Learning. Leading. Linking Errors in Trend Estimation
More informationPsychological testing
Psychological testing Lecture 12 Mikołaj Winiewski, PhD Test Construction Strategies Content validation Empirical Criterion Factor Analysis Mixed approach (all of the above) Content Validation Defining
More informationDifferential Item Functioning
Differential Item Functioning Lecture #11 ICPSR Item Response Theory Workshop Lecture #11: 1of 62 Lecture Overview Detection of Differential Item Functioning (DIF) Distinguish Bias from DIF Test vs. Item
More informationAndré Cyr and Alexander Davies
Item Response Theory and Latent variable modeling for surveys with complex sampling design The case of the National Longitudinal Survey of Children and Youth in Canada Background André Cyr and Alexander
More informationMCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2
MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and Lord Equating Methods 1,2 Lisa A. Keller, Ronald K. Hambleton, Pauline Parker, Jenna Copella University of Massachusetts
More informationCHAPTER 3 DATA ANALYSIS: DESCRIBING DATA
Data Analysis: Describing Data CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA In the analysis process, the researcher tries to evaluate the data collected both from written documents and from other sources such
More informationTHE NATURE OF OBJECTIVITY WITH THE RASCH MODEL
JOURNAL OF EDUCATIONAL MEASUREMENT VOL. II, NO, 2 FALL 1974 THE NATURE OF OBJECTIVITY WITH THE RASCH MODEL SUSAN E. WHITELY' AND RENE V. DAWIS 2 University of Minnesota Although it has been claimed that
More informationReliability, validity, and all that jazz
Reliability, validity, and all that jazz Dylan Wiliam King s College London Introduction No measuring instrument is perfect. The most obvious problems relate to reliability. If we use a thermometer to
More informationChapter 2 Norms and Basic Statistics for Testing MULTIPLE CHOICE
Chapter 2 Norms and Basic Statistics for Testing MULTIPLE CHOICE 1. When you assert that it is improbable that the mean intelligence test score of a particular group is 100, you are using. a. descriptive
More informationJSM Survey Research Methods Section
Methods and Issues in Trimming Extreme Weights in Sample Surveys Frank Potter and Yuhong Zheng Mathematica Policy Research, P.O. Box 393, Princeton, NJ 08543 Abstract In survey sampling practice, unequal
More informationDescription of components in tailored testing
Behavior Research Methods & Instrumentation 1977. Vol. 9 (2).153-157 Description of components in tailored testing WAYNE M. PATIENCE University ofmissouri, Columbia, Missouri 65201 The major purpose of
More informationINVESTIGATING FIT WITH THE RASCH MODEL. Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form
INVESTIGATING FIT WITH THE RASCH MODEL Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form of multidimensionality. The settings in which measurement
More informationBruno D. Zumbo, Ph.D. University of Northern British Columbia
Bruno Zumbo 1 The Effect of DIF and Impact on Classical Test Statistics: Undetected DIF and Impact, and the Reliability and Interpretability of Scores from a Language Proficiency Test Bruno D. Zumbo, Ph.D.
More informationMeasuring mathematics anxiety: Paper 2 - Constructing and validating the measure. Rob Cavanagh Len Sparrow Curtin University
Measuring mathematics anxiety: Paper 2 - Constructing and validating the measure Rob Cavanagh Len Sparrow Curtin University R.Cavanagh@curtin.edu.au Abstract The study sought to measure mathematics anxiety
More informationAPPLYING THE RASCH MODEL TO PSYCHO-SOCIAL MEASUREMENT A PRACTICAL APPROACH
APPLYING THE RASCH MODEL TO PSYCHO-SOCIAL MEASUREMENT A PRACTICAL APPROACH Margaret Wu & Ray Adams Documents supplied on behalf of the authors by Educational Measurement Solutions TABLE OF CONTENT CHAPTER
More information11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES
Correlational Research Correlational Designs Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are
More informationChapter 2--Norms and Basic Statistics for Testing
Chapter 2--Norms and Basic Statistics for Testing Student: 1. Statistical procedures that summarize and describe a series of observations are called A. inferential statistics. B. descriptive statistics.
More informationRunning head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note
Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1 Nested Factor Analytic Model Comparison as a Means to Detect Aberrant Response Patterns John M. Clark III Pearson Author Note John M. Clark III,
More informationApplications. DSC 410/510 Multivariate Statistical Methods. Discriminating Two Groups. What is Discriminant Analysis
DSC 4/5 Multivariate Statistical Methods Applications DSC 4/5 Multivariate Statistical Methods Discriminant Analysis Identify the group to which an object or case (e.g. person, firm, product) belongs:
More informationBuilding Evaluation Scales for NLP using Item Response Theory
Building Evaluation Scales for NLP using Item Response Theory John Lalor CICS, UMass Amherst Joint work with Hao Wu (BC) and Hong Yu (UMMS) Motivation Evaluation metrics for NLP have been mostly unchanged
More informationSamantha Sample 01 Feb 2013 EXPERT STANDARD REPORT ABILITY ADAPT-G ADAPTIVE GENERAL REASONING TEST. Psychometrics Ltd.
01 Feb 2013 EXPERT STANDARD REPORT ADAPTIVE GENERAL REASONING TEST ABILITY ADAPT-G REPORT STRUCTURE The Standard Report presents s results in the following sections: 1. Guide to Using This Report Introduction
More informationSection 5. Field Test Analyses
Section 5. Field Test Analyses Following the receipt of the final scored file from Measurement Incorporated (MI), the field test analyses were completed. The analysis of the field test data can be broken
More informationLec 02: Estimation & Hypothesis Testing in Animal Ecology
Lec 02: Estimation & Hypothesis Testing in Animal Ecology Parameter Estimation from Samples Samples We typically observe systems incompletely, i.e., we sample according to a designed protocol. We then
More informationBasic concepts and principles of classical test theory
Basic concepts and principles of classical test theory Jan-Eric Gustafsson What is measurement? Assignment of numbers to aspects of individuals according to some rule. The aspect which is measured must
More informationExamining the Psychometric Properties of The McQuaig Occupational Test
Examining the Psychometric Properties of The McQuaig Occupational Test Prepared for: The McQuaig Institute of Executive Development Ltd., Toronto, Canada Prepared by: Henryk Krajewski, Ph.D., Senior Consultant,
More informationCHAPTER 3 RESEARCH METHODOLOGY
CHAPTER 3 RESEARCH METHODOLOGY 3.1 Introduction 3.1 Methodology 3.1.1 Research Design 3.1. Research Framework Design 3.1.3 Research Instrument 3.1.4 Validity of Questionnaire 3.1.5 Statistical Measurement
More informationChapter 3 Tools for Practical Theorizing: Theoretical Maps and Ecosystem Maps
Chapter 3 Tools for Practical Theorizing: Theoretical Maps and Ecosystem Maps Chapter Outline I. Introduction A. Understanding theoretical languages requires universal translators 1. Theoretical maps identify
More informationCARE Cross-project Collectives Analysis: Technical Appendix
CARE Cross-project Collectives Analysis: Technical Appendix The CARE approach to development support and women s economic empowerment is often based on the building blocks of collectives. Some of these
More informationUvA-DARE (Digital Academic Repository)
UvA-DARE (Digital Academic Repository) Standaarden voor kerndoelen basisonderwijs : de ontwikkeling van standaarden voor kerndoelen basisonderwijs op basis van resultaten uit peilingsonderzoek van der
More informationBehavioral Intervention Rating Rubric. Group Design
Behavioral Intervention Rating Rubric Group Design Participants Do the students in the study exhibit intensive social, emotional, or behavioral challenges? % of participants currently exhibiting intensive
More informationUnit 1 Exploring and Understanding Data
Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile
More informationDesign and Analysis Plan Quantitative Synthesis of Federally-Funded Teen Pregnancy Prevention Programs HHS Contract #HHSP I 5/2/2016
Design and Analysis Plan Quantitative Synthesis of Federally-Funded Teen Pregnancy Prevention Programs HHS Contract #HHSP233201500069I 5/2/2016 Overview The goal of the meta-analysis is to assess the effects
More informationTest Validity. What is validity? Types of validity IOP 301-T. Content validity. Content-description Criterion-description Construct-identification
What is? IOP 301-T Test Validity It is the accuracy of the measure in reflecting the concept it is supposed to measure. In simple English, the of a test concerns what the test measures and how well it
More informationInvestigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories
Kamla-Raj 010 Int J Edu Sci, (): 107-113 (010) Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories O.O. Adedoyin Department of Educational Foundations,
More informationComparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria
Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Thakur Karkee Measurement Incorporated Dong-In Kim CTB/McGraw-Hill Kevin Fatica CTB/McGraw-Hill
More informationGMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups
GMAC Scaling Item Difficulty Estimates from Nonequivalent Groups Fanmin Guo, Lawrence Rudner, and Eileen Talento-Miller GMAC Research Reports RR-09-03 April 3, 2009 Abstract By placing item statistics
More informationInvestigating the Reliability of Classroom Observation Protocols: The Case of PLATO. M. Ken Cor Stanford University School of Education.
The Reliability of PLATO Running Head: THE RELIABILTY OF PLATO Investigating the Reliability of Classroom Observation Protocols: The Case of PLATO M. Ken Cor Stanford University School of Education April,
More informationMBA SEMESTER III. MB0050 Research Methodology- 4 Credits. (Book ID: B1206 ) Assignment Set- 1 (60 Marks)
MBA SEMESTER III MB0050 Research Methodology- 4 Credits (Book ID: B1206 ) Assignment Set- 1 (60 Marks) Note: Each question carries 10 Marks. Answer all the questions Q1. a. Differentiate between nominal,
More informationMarc J. Tassé, PhD Nisonger Center UCEDD
FINALLY... AN ADAPTIVE BEHAVIOR SCALE FOCUSED ON PROVIDING PRECISION AT THE DIAGNOSTIC CUT-OFF. How Item Response Theory Contributed to the Development of the DABS Marc J. Tassé, PhD UCEDD The Ohio State
More informationBIOSTATISTICAL METHODS AND RESEARCH DESIGNS. Xihong Lin Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
BIOSTATISTICAL METHODS AND RESEARCH DESIGNS Xihong Lin Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA Keywords: Case-control study, Cohort study, Cross-Sectional Study, Generalized
More informationLikelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati.
Likelihood Ratio Based Computerized Classification Testing Nathan A. Thompson Assessment Systems Corporation & University of Cincinnati Shungwon Ro Kenexa Abstract An efficient method for making decisions
More informationThe Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing
The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing Terry A. Ackerman University of Illinois This study investigated the effect of using multidimensional items in
More informationalternate-form reliability The degree to which two or more versions of the same test correlate with one another. In clinical studies in which a given function is going to be tested more than once over
More informationTest review. Comprehensive Trail Making Test (CTMT) By Cecil R. Reynolds. Austin, Texas: PRO-ED, Inc., Test description
Archives of Clinical Neuropsychology 19 (2004) 703 708 Test review Comprehensive Trail Making Test (CTMT) By Cecil R. Reynolds. Austin, Texas: PRO-ED, Inc., 2002 1. Test description The Trail Making Test
More informationHierarchical Bayesian Modeling of Individual Differences in Texture Discrimination
Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination Timothy N. Rubin (trubin@uci.edu) Michael D. Lee (mdlee@uci.edu) Charles F. Chubb (cchubb@uci.edu) Department of Cognitive
More informationGender-Based Differential Item Performance in English Usage Items
A C T Research Report Series 89-6 Gender-Based Differential Item Performance in English Usage Items Catherine J. Welch Allen E. Doolittle August 1989 For additional copies write: ACT Research Report Series
More informationDetection Theory: Sensitivity and Response Bias
Detection Theory: Sensitivity and Response Bias Lewis O. Harvey, Jr. Department of Psychology University of Colorado Boulder, Colorado The Brain (Observable) Stimulus System (Observable) Response System
More informationAssessing the Validity and Reliability of the Teacher Keys Effectiveness. System (TKES) and the Leader Keys Effectiveness System (LKES)
Assessing the Validity and Reliability of the Teacher Keys Effectiveness System (TKES) and the Leader Keys Effectiveness System (LKES) of the Georgia Department of Education Submitted by The Georgia Center
More informationRAG Rating Indicator Values
Technical Guide RAG Rating Indicator Values Introduction This document sets out Public Health England s standard approach to the use of RAG ratings for indicator values in relation to comparator or benchmark
More informationAnalysis of Confidence Rating Pilot Data: Executive Summary for the UKCAT Board
Analysis of Confidence Rating Pilot Data: Executive Summary for the UKCAT Board Paul Tiffin & Lewis Paton University of York Background Self-confidence may be the best non-cognitive predictor of future
More informationImportance of Good Measurement
Importance of Good Measurement Technical Adequacy of Assessments: Validity and Reliability Dr. K. A. Korb University of Jos The conclusions in a study are only as good as the data that is collected. The
More informationEvaluation Models STUDIES OF DIAGNOSTIC EFFICIENCY
2. Evaluation Model 2 Evaluation Models To understand the strengths and weaknesses of evaluation, one must keep in mind its fundamental purpose: to inform those who make decisions. The inferences drawn
More informationChapter-2 RESEARCH DESIGN
Chapter-2 RESEARCH DESIGN 33 2.1 Introduction to Research Methodology: The general meaning of research is the search for knowledge. Research is also defined as a careful investigation or inquiry, especially
More informationMantel-Haenszel Procedures for Detecting Differential Item Functioning
A Comparison of Logistic Regression and Mantel-Haenszel Procedures for Detecting Differential Item Functioning H. Jane Rogers, Teachers College, Columbia University Hariharan Swaminathan, University of
More informationShiken: JALT Testing & Evaluation SIG Newsletter. 12 (2). April 2008 (p )
Rasch Measurementt iin Language Educattiion Partt 2:: Measurementt Scalles and Invariiance by James Sick, Ed.D. (J. F. Oberlin University, Tokyo) Part 1 of this series presented an overview of Rasch measurement
More informationThe Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests
The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests Mary E. Lunz and Betty A. Bergstrom, American Society of Clinical Pathologists Benjamin D. Wright, University
More informationWrite your identification number on each paper and cover sheet (the number stated in the upper right hand corner on your exam cover).
STOCKHOLM UNIVERSITY Department of Economics Course name: Empirical methods 2 Course code: EC2402 Examiner: Per Pettersson-Lidbom Number of credits: 7,5 credits Date of exam: Sunday 21 February 2010 Examination
More informationMeasurement Invariance (MI): a general overview
Measurement Invariance (MI): a general overview Eric Duku Offord Centre for Child Studies 21 January 2015 Plan Background What is Measurement Invariance Methodology to test MI Challenges with post-hoc
More informationCritical Thinking Assessment at MCC. How are we doing?
Critical Thinking Assessment at MCC How are we doing? Prepared by Maura McCool, M.S. Office of Research, Evaluation and Assessment Metropolitan Community Colleges Fall 2003 1 General Education Assessment
More informationA Brief Introduction to Bayesian Statistics
A Brief Introduction to Statistics David Kaplan Department of Educational Psychology Methods for Social Policy Research and, Washington, DC 2017 1 / 37 The Reverend Thomas Bayes, 1701 1761 2 / 37 Pierre-Simon
More informationAnalysis and Interpretation of Data Part 1
Analysis and Interpretation of Data Part 1 DATA ANALYSIS: PRELIMINARY STEPS 1. Editing Field Edit Completeness Legibility Comprehensibility Consistency Uniformity Central Office Edit 2. Coding Specifying
More informationDevelopment, Standardization and Application of
American Journal of Educational Research, 2018, Vol. 6, No. 3, 238-257 Available online at http://pubs.sciepub.com/education/6/3/11 Science and Education Publishing DOI:10.12691/education-6-3-11 Development,
More informationPUBLIC KNOWLEDGE AND ATTITUDES SCALE CONSTRUCTION: DEVELOPMENT OF SHORT FORMS
PUBLIC KNOWLEDGE AND ATTITUDES SCALE CONSTRUCTION: DEVELOPMENT OF SHORT FORMS Prepared for: Robert K. Bell, Ph.D. National Science Foundation Division of Science Resources Studies 4201 Wilson Blvd. Arlington,
More informationEvaluating the quality of analytic ratings with Mokken scaling
Psychological Test and Assessment Modeling, Volume 57, 2015 (3), 423-444 Evaluating the quality of analytic ratings with Mokken scaling Stefanie A. Wind 1 Abstract Greatly influenced by the work of Rasch
More informationo^ &&cvi AL Perceptual and Motor Skills, 1965, 20, Southern Universities Press 1965
Ml 3 Hi o^ &&cvi AL 44755 Perceptual and Motor Skills, 1965, 20, 311-316. Southern Universities Press 1965 m CONFIDENCE RATINGS AND LEVEL OF PERFORMANCE ON A JUDGMENTAL TASK 1 RAYMOND S. NICKERSON AND
More informationSTATISTICS AND RESEARCH DESIGN
Statistics 1 STATISTICS AND RESEARCH DESIGN These are subjects that are frequently confused. Both subjects often evoke student anxiety and avoidance. To further complicate matters, both areas appear have
More informationSchool Annual Education Report (AER) Cover Letter
Lincoln Elementary Sam Skeels, Principal 158 S. Scott St Adrian, MI 49221 Phone: 517-265-8544 School (AER) Cover Letter April 29, 2017 Dear Parents and Community Members: We are pleased to present you
More informationConnexion of Item Response Theory to Decision Making in Chess. Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan
Connexion of Item Response Theory to Decision Making in Chess Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan Acknowledgement A few Slides have been taken from the following presentation
More informationCochrane Pregnancy and Childbirth Group Methodological Guidelines
Cochrane Pregnancy and Childbirth Group Methodological Guidelines [Prepared by Simon Gates: July 2009, updated July 2012] These guidelines are intended to aid quality and consistency across the reviews
More informationChapter 11. Experimental Design: One-Way Independent Samples Design
11-1 Chapter 11. Experimental Design: One-Way Independent Samples Design Advantages and Limitations Comparing Two Groups Comparing t Test to ANOVA Independent Samples t Test Independent Samples ANOVA Comparing
More informationCross-validation of easycbm Reading Cut Scores in Washington:
Technical Report # 1109 Cross-validation of easycbm Reading Cut Scores in Washington: 2009-2010 P. Shawn Irvin Bitnara Jasmine Park Daniel Anderson Julie Alonzo Gerald Tindal University of Oregon Published
More informationTHE USE OF MULTIVARIATE ANALYSIS IN DEVELOPMENT THEORY: A CRITIQUE OF THE APPROACH ADOPTED BY ADELMAN AND MORRIS A. C. RAYNER
THE USE OF MULTIVARIATE ANALYSIS IN DEVELOPMENT THEORY: A CRITIQUE OF THE APPROACH ADOPTED BY ADELMAN AND MORRIS A. C. RAYNER Introduction, 639. Factor analysis, 639. Discriminant analysis, 644. INTRODUCTION
More informationValidating Measures of Self Control via Rasch Measurement. Jonathan Hasford Department of Marketing, University of Kentucky
Validating Measures of Self Control via Rasch Measurement Jonathan Hasford Department of Marketing, University of Kentucky Kelly D. Bradley Department of Educational Policy Studies & Evaluation, University
More informationStatistics. Nur Hidayanto PSP English Education Dept. SStatistics/Nur Hidayanto PSP/PBI
Statistics Nur Hidayanto PSP English Education Dept. RESEARCH STATISTICS WHAT S THE RELATIONSHIP? RESEARCH RESEARCH positivistic Prepositivistic Postpositivistic Data Initial Observation (research Question)
More informationMultilevel IRT for group-level diagnosis. Chanho Park Daniel M. Bolt. University of Wisconsin-Madison
Group-Level Diagnosis 1 N.B. Please do not cite or distribute. Multilevel IRT for group-level diagnosis Chanho Park Daniel M. Bolt University of Wisconsin-Madison Paper presented at the annual meeting
More information1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp
The Stata Journal (22) 2, Number 3, pp. 28 289 Comparative assessment of three common algorithms for estimating the variance of the area under the nonparametric receiver operating characteristic curve
More informationlinking in educational measurement: Taking differential motivation into account 1
Selecting a data collection design for linking in educational measurement: Taking differential motivation into account 1 Abstract In educational measurement, multiple test forms are often constructed to
More informationEmpirical Formula for Creating Error Bars for the Method of Paired Comparison
Empirical Formula for Creating Error Bars for the Method of Paired Comparison Ethan D. Montag Rochester Institute of Technology Munsell Color Science Laboratory Chester F. Carlson Center for Imaging Science
More informationPERSONALITY ASSESSMENT PROFICIENCY: REPORT REVIEW FORM
PERSONALITY ASSESSMENT PROFICIENCY: REPORT REVIEW FORM Applicant Name: Reviewer Name: Date: I. Please consider each criteria item as either: Met proficiency criterion (Yes, circle 1 point) or Not met proficiency
More informationA Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests
A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests David Shin Pearson Educational Measurement May 007 rr0701 Using assessment and research to promote learning Pearson Educational
More informationTitle: A new statistical test for trends: establishing the properties of a test for repeated binomial observations on a set of items
Title: A new statistical test for trends: establishing the properties of a test for repeated binomial observations on a set of items Introduction Many studies of therapies with single subjects involve
More informationDRAFT (Final) Concept Paper On choosing appropriate estimands and defining sensitivity analyses in confirmatory clinical trials
DRAFT (Final) Concept Paper On choosing appropriate estimands and defining sensitivity analyses in confirmatory clinical trials EFSPI Comments Page General Priority (H/M/L) Comment The concept to develop
More informationCHAPTER - 6 STATISTICAL ANALYSIS. This chapter discusses inferential statistics, which use sample data to
CHAPTER - 6 STATISTICAL ANALYSIS 6.1 Introduction This chapter discusses inferential statistics, which use sample data to make decisions or inferences about population. Populations are group of interest
More informationSUPPLEMENTARY INFORMATION. Table 1 Patient characteristics Preoperative. language testing
Categorical Speech Representation in the Human Superior Temporal Gyrus Edward F. Chang, Jochem W. Rieger, Keith D. Johnson, Mitchel S. Berger, Nicholas M. Barbaro, Robert T. Knight SUPPLEMENTARY INFORMATION
More informationMulti-Specialty Recruitment Assessment Test Blueprint & Information
Multi-Specialty Recruitment Assessment Test Blueprint & Information 1. Structure of the Multi-Specialty Recruitment Assessment... 2 2. Professional Dilemmas Paper... 3 2.2. Context/Setting... 3 2.3. Target
More information