Technical Specifications

Size: px
Start display at page:

Download "Technical Specifications"

Transcription

1 Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically used in classroom testing summing the number of correct responses to exercises in order to yield a total score. In some cases, these total scores are converted to percentage correct and assigned a letter grade. In other cases, a single cut point is selected to identify learners who have met some pre-selected criterion (e.g., 80 percent). This traditional scoring model assumes that each correct response is equally important and, therefore, is given equal weight in arriving at the total score. This assumption results in an additive model, in which the total score emphasizes the relative amount of knowledge and skill demonstrated, with no attention paid to the pattern of correct and incorrect responses. The additive model extends from the less formal classroom tests to more formal standardized tests. Standardized tests frequently employ this additive model through the use of stanine and grade-equivalent scores. These standardized scores are determined through the use of particular norming populations and represent the transformation of a set of raw scores. Recent advances in computer technology and development of efficient algorithms have made the application of more powerful psychometric theories accessible to a wider range of users. Item Response Theory (IRT) is the model used in building and managing both the ETS PDQ Profile Series and the ETS Health Activities Literacy tests. Unlike traditional additive models, the IRT model allows us to estimate the difficulty level of a particular exercise relative to the difficulty of all other exercises in the assessments, as well as the estimation of an individual's proficiency level in the assessed domain. The IRT model is also able to take the pattern of right and wrong responses into account. More specifically, item response theory (IRT) is a mathematical model for the probability that a particular person will respond correctly to a particular item from a domain of items. The particular IRT model employed in these tests is the two-parameter logistic model. This probability is given as a function of a parameter characterizing the proficiency of that person, and two parameters characterizing the properties of that item difficulty and discrimination. One of the strengths of IRT models is that when their assumptions hold and estimates of the model s item parameters are available for the collections of items that make up the different test forms, all results can be reported directly in terms of the IRT proficiency. This property of IRT scaling removes the need to establish the comparability of numbercorrect score scales for different forms of the test. A pool of tasks over which performance is modeled and the accompanying proficiency variable are referred to as a "scale." Analyses within a scale are generally carried out in two steps: first, the parameters of the tasks are estimated; and second, the proficiency levels of individuals or groups are estimated with the item parameters estimated earlier. A unidimensional IRT model such as the two-parameter logistic model assumes that performance on all of the tasks in a particular domain can, for the most part, be accounted for by a single underlying proficiency variable. Since tasks are placed along an IRT scale on the basis of their characteristics, it is possible to address questions concerning the nature of these tasks and their position

2 on the scale relative to one another. For example, questions such as the following come to mind: Why do tasks fall at different levels along a given scale? Do tasks that cluster around a given scale point reflect similar interactions between materials and processing demands? Can tasks at different levels be distinguished in terms of these variables? Answers to questions such as these lead to a better understanding of the nature of the knowledge and skills underlying successful performance at various proficiency levels within the domain assessed. In addition to focusing attention on the distribution of tasks along a scale, the IRT model used in these tests also estimates the proficiency levels of individuals based on their patterns of right and wrong responses to the set of exercises that comprise the scale. This is unlike the traditional additive scoring model, which simply ranks individuals relative to their total score independently of which items they answered correctly or incorrectly. Establishing the PDQ literacy scales: linkage to the other surveys Prose, document, and quantitative literacy proficiencies are reported on scales that have been established across several national and international adult literacy surveys. Together, these surveys of literacy represent more than 800 items and over 145,000 respondents. The surveys include the: Young Adult Literacy Survey, Department of Labor Survey of Workplace Literacy, National Adult Literacy Survey, International Adult Literacy Survey, and the International Adult Literacy and Lifeskills Survey. These survey data are all matrix sampled; i.e., each respondent received only a subset of items out of a larger item pool according to a particular administration design. This sampling design enables us to have response data on large numbers of items while minimizing the response burden at the individual respondent level. The psychometric measurement model is used to seam all responses together to construct a single scale representation. The use of a psychometric model is critical since none of the surveys included all 800 items. While only a subset of items was used in any one survey, each survey contained a substantial number of items that were identical to at least one other survey, thus providing the necessary overlap to link the items. This overlap of items within multiple surveys enables us to establish and maintain common literacy scales across the various surveys. Items making up each of the literacy scales provide a pool from which to select subsets of items having appropriate characteristics for measuring an individual s literacy proficiency. Establishing the Health Activities Literacy Scale (HALS) The Health Activities Literacy Scale is organized around five general categories of health-related activities: Health promotion enhancing and maintaining one s own health including activities related to nutrition and exercise Health protection safeguarding the health of individuals and communities including health-related social and environmental issues.

3 Disease Prevention taking preventive measures (e.g., immunizations) & engaging in early detection such as screening programs Health Care & maintenance seeking care & forming a partnership with health providers Systems Navigation accessing needed services and understanding rights All of the items used in previous literacy assessments were reviewed by three researchers to select those that were judged to measure health-related activities. Then those researchers independently coded the191selected tasks into one of the five health-related activities. All differences were resolved through discussions and refinement of the coding criteria. The distribution of these tasks by type of health activity is shown in the following table. Health Activities and Number of Coded Items Health Activities Number of Items (n=191) Health Promotion 60 Health Protection 65 Disease Prevention 18 Health Care and Maintenance 16 Systems Navigation 32 The 191 health-related literacy tasks that were identified provided a link across the existing literacy surveys, and were used to create a new Health Activities Literacy Scale (HALS). Using IRT methodology, new item parameters were estimated for each of these tasks based on the responses of a nationally representative sample. The surveys from which the 191 health-related literacy tasks were selected represent different populations having various demographic characteristics. Current methodologies provide researchers with the tools needed to evaluate the performance of people even when they have been administered somewhat different tasks and when they represent different samples and populations studied over time. These methodologies have been used with student surveys such as the National Assessment of Educational Progress (NAEP) and the Programme for International Student Assessment (PISA) as well as the adult literacy surveys mentioned previously. Therefore, even though the populations studied varied somewhat across the different surveys, the subsets of literacy tasks and the scoring rubrics that were common across the surveys were kept constant and their item parameters checked for their stability across each of the surveys. Over the years, the same item parameters have been found to fit very well to each of the subpopulations within a country as well as across countries with different languages. Once the health-related literacy tasks had been scaled, the stability of the new item parameters was verified across each of the surveys to ensure

4 they fit well. More than 58,000 respondents from across the various adult surveys were used to estimate and verify the item parameters for the health-related tasks. Because the focus of the current study is the U.S. population, only data from the U.S were used. The model used for scaling the health literacy items from the NALS data is the twoparameter logistic (2PL) model from item response theory. The stability of the item parameters was checked across the various survey populations to ensure the comparability of the data and the stability of the newly established scale. The common item parameters were reviewed to ensure that they fit well in order to justify the use of the new item parameters and to establish the stability of the new HALS. Five different approaches were used to evaluate the stability of the item parameters including: A graphical method which allows us to observe the item characteristic curves for various populations; Three statistical indices which estimate the fit of each item for each population against the common item parameter (X2 statistic, the Root Mean Squared Deviation statistic, and the weighted Mean Deviation); and The impact of the item parameter on the overall proficiency estimate of a particular population. Deviations are based on the difference between model-based expected proportions correct and observed proportions correct at 41 equally spaced ability scale values. The fit of the health-related literacy tasks was remarkably good based on any conventional standard and, therefore, a single set of common item parameters could be used to describe all survey samples. HALS is a new scale. Even though it is based on pre-existing items from existing literacy surveys, the properties of this new scale had not been previously defined. That is, the scale could range from 0 to100, from 200 to 800 or within some other preselected range. The procedure to align the health literacy activities scale with the NALS scales was based on matching two moments of the proficiency distributions the mean and standard deviation. In this study, the provisional proficiency distribution based on the health scale was matched to the distribution of means of three NALS scale proficiency values (m= and sd=65.380). This allowed us to do a linear transformation that defines the HALS on a scale ranging from 0 to 500 having the same mean and standard deviation as the three NALS proficiency scales. One of the benefits of the HALS lies in the fact that it uses items from existing large-scale surveys of adults. Several researchers reviewed each literacy task to determine how well it fit into the five health activities described in this report. This adds content relevance to the scale because each item was judged to be representative of a type of health activity, thus focusing the measurement on tasks that broadly define health literacy rather than general literacy. Each of the 191 items that make up the HALS had been administered to nationally representative samples of adults. Because a large number of adults responded to each item, we were able to check how well each item behaves psychometrically. For example, each item was checked for differential performance by selected subgroups. In addition, each item was checked to determine how well it fits onto the overall scale. Other pieces of information relating to the validity of the HALS stem from our understanding of the construct and what contributes to the difficulty of each item and its position along the health scale. The NALS database links the HALS to an extensive set of background information. This link also contributes to the validation of

5 the HALS. Using this information, we are able to see the correlations between the HALS and a wide range of background characteristics that include age, gender, race/ethnicity, and level of education. Design of testlets and item selection Both the PDQ Profile Series and Health Activities Literacy Test include locator tests that employ adaptive testing. Standard adaptive testing is carried out at an item level. The general idea of adaptive testing is to administer an item which provides the most information with regard to the respondent's proficiency. The measurement model determines which criteria will be used. While there are many benefits to using adaptive test design, there are some issues that need to be considered. For example, item level adaptive testing is not necessarily ideal under all circumstances due to the higher overhead costs. The most serious concern regarding use of item level adaptation for a test like the PDQ Profile Series or the Health Activities Literacy Test is that multiple questions are based on a single stimulus material. Thus, Item level adaptation is not very feasible for an individual. However, it is possible to have what might be called a quasi-adaptive test that is made up of testlets. A testlet is formed with several stimulus/items sets. Selecting and administering a testlet to those within the optimal range of proficiency can result in most of the efficiency gain one might expect from a fully adaptive test while addressing the other limiting factors. The locator adaptive test for both the PDQ and HALS consists of four phases: 1) a small set of background information, 2) stage 1 cognitive items, 3) stage 2 cognitive items, and 4) reporting of results. The background information collected in phase 1 provides an initial estimate of where someone s proficiency may be before the administration of any cognitive items. This initial estimate is based on the relationships between background information and literacy proficiency seen in the large-scale assessment data. Incorporating this background information into the overall design makes the administration of cognitive items (Stage 1) more efficient. The cognitive items administered in Stage 2 will be selected based on the results from both the background information and Stage 1 testing. However, the results will be calculated and reported based only on the responses to the cognitive items administered in Stages 1 and 2. Background information will not be used in the computation of results that are reported. A primary difference between the locator test and the full-length test is the level of precision we obtain about where a person is on a particular literacy scale. For the locator test, the goal is simply to determine if each person is in Level 1, Level 2, or Level 3 and higher. With the full-length test our goal is to actually estimate proficiency on each literacy scale. This requires more cognitive test information. Therefore, we add another stage of testing. Like in the locator test, each individual is given a set of background questions. This is followed by first by a set of cognitive items (Stage 1) and then a second set of cognitive based on the responses to the first set. The full-length test has an additional set of cognitive items (Stage 3). This testlet is longer than the two earlier testlets since the purpose of this full-length test is to estimate the proficiency of each respondent as accurately as possible. Results are then calculated and reported based only on the cognitive items taken in stages 1 to 3.

6 Test Design Test Background Stage 1 Stage 2 Stage 3 Reporting Locator X X X X Full-Length X X X X X The scaling model The scaling model used for the PDQ Profile Series and the Health Activities Literacy tests is the two-parameter logistic (2PL) model from Item Response Theory (Birnbaum, 1968; Lord, 1980). It is a mathematical model for the probability that a particular person will respond correctly to a particular item from a single domain of items. This probability is given as a function of a parameter characterizing the proficiency of that person, and two parameters characterizing the properties of that item. The following 2PL IRT model was employed in the IALS: 1 P i (θ j ) = P(x ij =1θ j,a i,b i ) = exp( Da i (θ j b i )) where x ij is the response of person j to item i, 1 if correct and 0 if incorrect; j is the proficiency of person j (note that a person with higher proficiency has a greater probability of responding correctly); D is a normalizing constant and 1.7; a i is the slope parameter of item i, characterizing its sensitivity to proficiency; b i is its location parameter, characterizing its difficulty. Note that this is a monotone increasing function with respect to ; that is, the conditional probability of a correct response increases as the value of increases. In addition, a linear indeterminacy exists with respect to the values of j, a i, and b i for a scale defined under the two-parameter model. In other words, for an arbitrary linear transformation of say * = M+ X, the corresponding transformations a * i = a i /M and b * i = Mb i + X give: P(x ij =1θ j *,a i *,b i * ) = P(x ij =1θ j,a i,b i ) The transformation constants for the PDQ provisional scale were set in 1994 as (51.67, ) for Prose, (52.46, ) for Document, and (54.41, ) for Quantitative. These constants were set so that the total proficiency distributions of Young Adult

7 Literacy Survey have the mean of and standard deviation of All subsequent surveys used these transformations. Another main assumption of IRT is conditional independence. In other words, item response probabilities depend only on (a measure of proficiency) and the specified item parameters, and not on any demographic characteristics of respondent, or on any other items presented together in a test, or on the survey administration conditions. This extends to multiple languages and multiple assessments over time and enables us to place all items on one scale, even though these items were in multiple surveys administered to multiple populations over time. The assumption was monitored for every survey sample using 2 statistics, square root of weighted Mean Squared Deviation, and weighted Mean Deviation. This conditional independence enables us to formulate the following joint probability of a particular response pattern x across a set of n items. n P(x θ,a,b) = P i (θ ) x i (1 P i (θ )) 1 x i i =1 Replacing the hypothetical response pattern with the real scored data, the above function can be viewed as a likelihood function that is to be maximized with a given set of item parameters. These item parameters were treated as known for the subsequent analyses. Another assumption of the model is unidimensionality that is, performance on a set of items is accounted for by a single unidimensional variable. Although this assumption may be too strong, the use of the model is motivated by the need to summarize overall performance parsimoniously within a single domain. Hence, item parameters were estimated for each scale separately. Testing the assumptions of the IRT model, especially the assumption of conditional independence, is a critical part of the data analyses. The conditional independence means that respondents with identical abilities have a similar probability of producing a correct response on an item regardless of their group membership. Serious violation of the conditional independence assumption would undermine the accuracy and integrity of the results. It is a common practice to expect a portion of items to be found not suitable for a particular subpopulation. Thus, while the item parameters were being estimated, empirical conditional percentages correct were monitored across the samples. Estimation of Proficiency As described earlier, the PDQ scales were based on data from large-scale surveys. The purpose of a large-scale survey is to focus on the proficiency distributions of subpopulations rather than the proficiencies of individuals. However, the primary interest of PDQ is to estimate the proficiency of respondents. The proficiency estimation method chosen for PDQ is expectation of likelihood functions, which is approximated by the numerical integration of the following function. f (θ x j,a,b)

8 It is worth noting that this function was found to be more accurate than using maximum likelihood estimation since values at the extremes of the ability distribution are less stable. Reliability of proficiency scores The reliability of a test can been defined as the degree of consistency between two measures of the same thing. This is operationally defined as a measure of the degree of true-score variation relative to observed-score variation in classical test theory. An analogous extension to the IRT model would be the ratio of population variance minus measurement error divided by the overall population variance. One difficulty with this notion is that the measurement error in an IRT model is proficiency dependent and varies over the range of proficiency. In addition, the literacy tests described here are adaptive, meaning there are many combinations of items depending upon the proficiency of a given respondent. In order to characterize the consistency of measurement, an index of expectation of the measurement errors over proficiency was calculated and averaged across combinations of items used for the full-length tests. This resulted in the following estimates of reliability: Prose:.925 Document:.882 Quantitative:.883 Health Activities Literacy :.935 In contrast to the full-length test, the purpose of the locator test is to characterize the proficiency of respondents into one of three levels, based on a 0 to 500 point scale: whether they are performing below 225 (Level 1), between 225 and 275 (Level 2), or above 275 (Level 3 and above) as accurately as possible. The accuracy of classification depends on the distance each respondent's proficiency is from one of the cut points. For example, if the proficiency is 50 points away from the cut point, the accuracy is nearly perfect -.98 for all three scales. However, if the proficiency were just about on either side of a cut point, then the accuracy would be just about.50. The statistics below indicate the average accuracy of classifying respondents into these levels for each of the scales. These estimates are based on the average accuracy calculated at various distances in each of the levels. Literacy scale Locator accuracy Prose.880 Document.881 Quantitative.882 Health.931

9 Validity of scores Validity refers to the appropriateness, meaningfulness, and usefulness of the specific inferences made from test scores. The validity of the PDQ Profile Series and the Health Activities Literacy Test come from the various efforts that were undertaken in the development and conduct of the large-scale national and international assessments. Collectively, these efforts contribute to aspects of construct and criterion related validity. These activities include: A definition of literacy that was drafted by a national panel of experts and adopted by countries for use in the international assessments; The development of a framework that operationalized this definition, drawing on recent theories of reading and literacy; The development of literacy tasks that used everyday materials selected by panels of experts and reviewed both nationally and internationally; The administration of all items to large representative samples of adults covering a wide range of social and economic backgrounds and checked for bias; The conduct of item attribute studies to determine the kinds of knowledge and skills that are associated with successful performance; The linkage of each literacy scale to extensive background characteristics so that it is possible to examine the connection between these characteristics and performance on the various literacy scales. The definition and framework used to construct and understand the literacy scales form the body of information that supports the construct validity of these measures. See, for example, Kirsch, I., The International Adult Literacy Survey: Defining What Was Measured. Princeton, NJ: Educational Testing Service. RR The strong connections that are observed between these various literacy scales and the set of background characteristics support the criterion related validity of the measures. This type of information is printed in the reports and publications stemming from the large-scale assessments. A selected set of tables representing these relationships is provided here. The idea is that if the PDQ and HALS tests are measuring aspects of literacy that are important to participation in various aspects of society such as education and labor then respondents' performance on the PDQ and HALS should relate to these and other variables that are collected as part of the survey. All four scales are shown here to have strong positive relationships with respondents' highest level of education. Average Literacy Proficiency by Education Level Education level Prose Document Quantitative Health 0-8 yrs yrs GED/High School Some College yr College

10 degree 4yr College degree Graduate study/degree The relationship between respondents' age and proficiency are related less strongly and also curvilinearly. The average proficiency is highest among those who are between 35 and 44 years of age. It should be noted that document proficiency seems to have the strongest relationship with age as evidenced by the sharp decline of proficiency after 45. Average Literacy Proficiency by Age Age Prose Document Quantitative Health 16 to to to to to and older It was expected that country of birth would be strongly related to English literacy skills since it likely to reflect the language one first learns to speak and read. As seen below the average difference between those who were born in the US and those who were born in another country is about 60 to 70 points. Average Literacy Proficiency by Country of Birth Country of Birth Prose Document Quantitative Health Born in the USA Born in another country Examples of reading practices and literacy skills are also shown below. It is clear that those with higher literacy skills read newspapers far more often than others. There is a very strong indication that respondents who do not read newspapers have very low literacy skills. Average Literacy Proficiency by Newspaper Reading Practices Newspaper Reading Prose Document Quantitative Health Practices Every day A few times a

11 week Once a week Less than once a week Never The table below shows how literacy proficiency relates with employment status. Those respondents who are employed full time show much higher literacy skills than those who are unemployed or those who are out of the labor force. Average Proficiency by Labor Force Status Labor Force Status Prose Document Quantitative Health Full-time employed Part-time employed Unemployed Out of labor force In addition to employment status, the type of occupation has a strong relationship with literacy proficiency as is shown in the table below. The respondents who are employed as professionals or managers have much higher literacy skills than those who work as laborers. Average Proficiency by Most Recent Occupation Most Recent Prose Document Quantitative Health Occupation Professional/Managers Sales Craft Laborer

Scaling TOWES and Linking to IALS

Scaling TOWES and Linking to IALS Scaling TOWES and Linking to IALS Kentaro Yamamoto and Irwin Kirsch March, 2002 In 2000, the Organization for Economic Cooperation and Development (OECD) along with Statistics Canada released Literacy

More information

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS Michael J. Kolen The University of Iowa March 2011 Commissioned by the Center for K 12 Assessment & Performance Management at

More information

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison Empowered by Psychometrics The Fundamentals of Psychometrics Jim Wollack University of Wisconsin Madison Psycho-what? Psychometrics is the field of study concerned with the measurement of mental and psychological

More information

Linking Assessments: Concept and History

Linking Assessments: Concept and History Linking Assessments: Concept and History Michael J. Kolen, University of Iowa In this article, the history of linking is summarized, and current linking frameworks that have been proposed are considered.

More information

A framework for predicting item difficulty in reading tests

A framework for predicting item difficulty in reading tests Australian Council for Educational Research ACEReSearch OECD Programme for International Student Assessment (PISA) National and International Surveys 4-2012 A framework for predicting item difficulty in

More information

ESTIMATING PISA STUDENTS ON THE IALS PROSE LITERACY SCALE

ESTIMATING PISA STUDENTS ON THE IALS PROSE LITERACY SCALE ESTIMATING PISA STUDENTS ON THE IALS PROSE LITERACY SCALE Kentaro Yamamoto, Educational Testing Service (Summer, 2002) Before PISA, the International Adult Literacy Survey (IALS) was conducted multiple

More information

Computerized Mastery Testing

Computerized Mastery Testing Computerized Mastery Testing With Nonequivalent Testlets Kathleen Sheehan and Charles Lewis Educational Testing Service A procedure for determining the effect of testlet nonequivalence on the operating

More information

USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION

USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION Iweka Fidelis (Ph.D) Department of Educational Psychology, Guidance and Counselling, University of Port Harcourt,

More information

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD Psy 427 Cal State Northridge Andrew Ainsworth, PhD Contents Item Analysis in General Classical Test Theory Item Response Theory Basics Item Response Functions Item Information Functions Invariance IRT

More information

A Comparison of Several Goodness-of-Fit Statistics

A Comparison of Several Goodness-of-Fit Statistics A Comparison of Several Goodness-of-Fit Statistics Robert L. McKinley The University of Toledo Craig N. Mills Educational Testing Service A study was conducted to evaluate four goodnessof-fit procedures

More information

CHAPTER VI RESEARCH METHODOLOGY

CHAPTER VI RESEARCH METHODOLOGY CHAPTER VI RESEARCH METHODOLOGY 6.1 Research Design Research is an organized, systematic, data based, critical, objective, scientific inquiry or investigation into a specific problem, undertaken with the

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 39 Evaluation of Comparability of Scores and Passing Decisions for Different Item Pools of Computerized Adaptive Examinations

More information

Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys

Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys Jill F. Kilanowski, PhD, APRN,CPNP Associate Professor Alpha Zeta & Mu Chi Acknowledgements Dr. Li Lin,

More information

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses Item Response Theory Steven P. Reise University of California, U.S.A. Item response theory (IRT), or modern measurement theory, provides alternatives to classical test theory (CTT) methods for the construction,

More information

Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study

Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study Research Report Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study Xueli Xu Matthias von Davier April 2010 ETS RR-10-10 Listening. Learning. Leading. Linking Errors in Trend Estimation

More information

Psychological testing

Psychological testing Psychological testing Lecture 12 Mikołaj Winiewski, PhD Test Construction Strategies Content validation Empirical Criterion Factor Analysis Mixed approach (all of the above) Content Validation Defining

More information

Differential Item Functioning

Differential Item Functioning Differential Item Functioning Lecture #11 ICPSR Item Response Theory Workshop Lecture #11: 1of 62 Lecture Overview Detection of Differential Item Functioning (DIF) Distinguish Bias from DIF Test vs. Item

More information

André Cyr and Alexander Davies

André Cyr and Alexander Davies Item Response Theory and Latent variable modeling for surveys with complex sampling design The case of the National Longitudinal Survey of Children and Youth in Canada Background André Cyr and Alexander

More information

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2 MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and Lord Equating Methods 1,2 Lisa A. Keller, Ronald K. Hambleton, Pauline Parker, Jenna Copella University of Massachusetts

More information

CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA

CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA Data Analysis: Describing Data CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA In the analysis process, the researcher tries to evaluate the data collected both from written documents and from other sources such

More information

THE NATURE OF OBJECTIVITY WITH THE RASCH MODEL

THE NATURE OF OBJECTIVITY WITH THE RASCH MODEL JOURNAL OF EDUCATIONAL MEASUREMENT VOL. II, NO, 2 FALL 1974 THE NATURE OF OBJECTIVITY WITH THE RASCH MODEL SUSAN E. WHITELY' AND RENE V. DAWIS 2 University of Minnesota Although it has been claimed that

More information

Reliability, validity, and all that jazz

Reliability, validity, and all that jazz Reliability, validity, and all that jazz Dylan Wiliam King s College London Introduction No measuring instrument is perfect. The most obvious problems relate to reliability. If we use a thermometer to

More information

Chapter 2 Norms and Basic Statistics for Testing MULTIPLE CHOICE

Chapter 2 Norms and Basic Statistics for Testing MULTIPLE CHOICE Chapter 2 Norms and Basic Statistics for Testing MULTIPLE CHOICE 1. When you assert that it is improbable that the mean intelligence test score of a particular group is 100, you are using. a. descriptive

More information

JSM Survey Research Methods Section

JSM Survey Research Methods Section Methods and Issues in Trimming Extreme Weights in Sample Surveys Frank Potter and Yuhong Zheng Mathematica Policy Research, P.O. Box 393, Princeton, NJ 08543 Abstract In survey sampling practice, unequal

More information

Description of components in tailored testing

Description of components in tailored testing Behavior Research Methods & Instrumentation 1977. Vol. 9 (2).153-157 Description of components in tailored testing WAYNE M. PATIENCE University ofmissouri, Columbia, Missouri 65201 The major purpose of

More information

INVESTIGATING FIT WITH THE RASCH MODEL. Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form

INVESTIGATING FIT WITH THE RASCH MODEL. Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form INVESTIGATING FIT WITH THE RASCH MODEL Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form of multidimensionality. The settings in which measurement

More information

Bruno D. Zumbo, Ph.D. University of Northern British Columbia

Bruno D. Zumbo, Ph.D. University of Northern British Columbia Bruno Zumbo 1 The Effect of DIF and Impact on Classical Test Statistics: Undetected DIF and Impact, and the Reliability and Interpretability of Scores from a Language Proficiency Test Bruno D. Zumbo, Ph.D.

More information

Measuring mathematics anxiety: Paper 2 - Constructing and validating the measure. Rob Cavanagh Len Sparrow Curtin University

Measuring mathematics anxiety: Paper 2 - Constructing and validating the measure. Rob Cavanagh Len Sparrow Curtin University Measuring mathematics anxiety: Paper 2 - Constructing and validating the measure Rob Cavanagh Len Sparrow Curtin University R.Cavanagh@curtin.edu.au Abstract The study sought to measure mathematics anxiety

More information

APPLYING THE RASCH MODEL TO PSYCHO-SOCIAL MEASUREMENT A PRACTICAL APPROACH

APPLYING THE RASCH MODEL TO PSYCHO-SOCIAL MEASUREMENT A PRACTICAL APPROACH APPLYING THE RASCH MODEL TO PSYCHO-SOCIAL MEASUREMENT A PRACTICAL APPROACH Margaret Wu & Ray Adams Documents supplied on behalf of the authors by Educational Measurement Solutions TABLE OF CONTENT CHAPTER

More information

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES Correlational Research Correlational Designs Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are

More information

Chapter 2--Norms and Basic Statistics for Testing

Chapter 2--Norms and Basic Statistics for Testing Chapter 2--Norms and Basic Statistics for Testing Student: 1. Statistical procedures that summarize and describe a series of observations are called A. inferential statistics. B. descriptive statistics.

More information

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1 Nested Factor Analytic Model Comparison as a Means to Detect Aberrant Response Patterns John M. Clark III Pearson Author Note John M. Clark III,

More information

Applications. DSC 410/510 Multivariate Statistical Methods. Discriminating Two Groups. What is Discriminant Analysis

Applications. DSC 410/510 Multivariate Statistical Methods. Discriminating Two Groups. What is Discriminant Analysis DSC 4/5 Multivariate Statistical Methods Applications DSC 4/5 Multivariate Statistical Methods Discriminant Analysis Identify the group to which an object or case (e.g. person, firm, product) belongs:

More information

Building Evaluation Scales for NLP using Item Response Theory

Building Evaluation Scales for NLP using Item Response Theory Building Evaluation Scales for NLP using Item Response Theory John Lalor CICS, UMass Amherst Joint work with Hao Wu (BC) and Hong Yu (UMMS) Motivation Evaluation metrics for NLP have been mostly unchanged

More information

Samantha Sample 01 Feb 2013 EXPERT STANDARD REPORT ABILITY ADAPT-G ADAPTIVE GENERAL REASONING TEST. Psychometrics Ltd.

Samantha Sample 01 Feb 2013 EXPERT STANDARD REPORT ABILITY ADAPT-G ADAPTIVE GENERAL REASONING TEST. Psychometrics Ltd. 01 Feb 2013 EXPERT STANDARD REPORT ADAPTIVE GENERAL REASONING TEST ABILITY ADAPT-G REPORT STRUCTURE The Standard Report presents s results in the following sections: 1. Guide to Using This Report Introduction

More information

Section 5. Field Test Analyses

Section 5. Field Test Analyses Section 5. Field Test Analyses Following the receipt of the final scored file from Measurement Incorporated (MI), the field test analyses were completed. The analysis of the field test data can be broken

More information

Lec 02: Estimation & Hypothesis Testing in Animal Ecology

Lec 02: Estimation & Hypothesis Testing in Animal Ecology Lec 02: Estimation & Hypothesis Testing in Animal Ecology Parameter Estimation from Samples Samples We typically observe systems incompletely, i.e., we sample according to a designed protocol. We then

More information

Basic concepts and principles of classical test theory

Basic concepts and principles of classical test theory Basic concepts and principles of classical test theory Jan-Eric Gustafsson What is measurement? Assignment of numbers to aspects of individuals according to some rule. The aspect which is measured must

More information

Examining the Psychometric Properties of The McQuaig Occupational Test

Examining the Psychometric Properties of The McQuaig Occupational Test Examining the Psychometric Properties of The McQuaig Occupational Test Prepared for: The McQuaig Institute of Executive Development Ltd., Toronto, Canada Prepared by: Henryk Krajewski, Ph.D., Senior Consultant,

More information

CHAPTER 3 RESEARCH METHODOLOGY

CHAPTER 3 RESEARCH METHODOLOGY CHAPTER 3 RESEARCH METHODOLOGY 3.1 Introduction 3.1 Methodology 3.1.1 Research Design 3.1. Research Framework Design 3.1.3 Research Instrument 3.1.4 Validity of Questionnaire 3.1.5 Statistical Measurement

More information

Chapter 3 Tools for Practical Theorizing: Theoretical Maps and Ecosystem Maps

Chapter 3 Tools for Practical Theorizing: Theoretical Maps and Ecosystem Maps Chapter 3 Tools for Practical Theorizing: Theoretical Maps and Ecosystem Maps Chapter Outline I. Introduction A. Understanding theoretical languages requires universal translators 1. Theoretical maps identify

More information

CARE Cross-project Collectives Analysis: Technical Appendix

CARE Cross-project Collectives Analysis: Technical Appendix CARE Cross-project Collectives Analysis: Technical Appendix The CARE approach to development support and women s economic empowerment is often based on the building blocks of collectives. Some of these

More information

UvA-DARE (Digital Academic Repository)

UvA-DARE (Digital Academic Repository) UvA-DARE (Digital Academic Repository) Standaarden voor kerndoelen basisonderwijs : de ontwikkeling van standaarden voor kerndoelen basisonderwijs op basis van resultaten uit peilingsonderzoek van der

More information

Behavioral Intervention Rating Rubric. Group Design

Behavioral Intervention Rating Rubric. Group Design Behavioral Intervention Rating Rubric Group Design Participants Do the students in the study exhibit intensive social, emotional, or behavioral challenges? % of participants currently exhibiting intensive

More information

Unit 1 Exploring and Understanding Data

Unit 1 Exploring and Understanding Data Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile

More information

Design and Analysis Plan Quantitative Synthesis of Federally-Funded Teen Pregnancy Prevention Programs HHS Contract #HHSP I 5/2/2016

Design and Analysis Plan Quantitative Synthesis of Federally-Funded Teen Pregnancy Prevention Programs HHS Contract #HHSP I 5/2/2016 Design and Analysis Plan Quantitative Synthesis of Federally-Funded Teen Pregnancy Prevention Programs HHS Contract #HHSP233201500069I 5/2/2016 Overview The goal of the meta-analysis is to assess the effects

More information

Test Validity. What is validity? Types of validity IOP 301-T. Content validity. Content-description Criterion-description Construct-identification

Test Validity. What is validity? Types of validity IOP 301-T. Content validity. Content-description Criterion-description Construct-identification What is? IOP 301-T Test Validity It is the accuracy of the measure in reflecting the concept it is supposed to measure. In simple English, the of a test concerns what the test measures and how well it

More information

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories Kamla-Raj 010 Int J Edu Sci, (): 107-113 (010) Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories O.O. Adedoyin Department of Educational Foundations,

More information

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Thakur Karkee Measurement Incorporated Dong-In Kim CTB/McGraw-Hill Kevin Fatica CTB/McGraw-Hill

More information

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups GMAC Scaling Item Difficulty Estimates from Nonequivalent Groups Fanmin Guo, Lawrence Rudner, and Eileen Talento-Miller GMAC Research Reports RR-09-03 April 3, 2009 Abstract By placing item statistics

More information

Investigating the Reliability of Classroom Observation Protocols: The Case of PLATO. M. Ken Cor Stanford University School of Education.

Investigating the Reliability of Classroom Observation Protocols: The Case of PLATO. M. Ken Cor Stanford University School of Education. The Reliability of PLATO Running Head: THE RELIABILTY OF PLATO Investigating the Reliability of Classroom Observation Protocols: The Case of PLATO M. Ken Cor Stanford University School of Education April,

More information

MBA SEMESTER III. MB0050 Research Methodology- 4 Credits. (Book ID: B1206 ) Assignment Set- 1 (60 Marks)

MBA SEMESTER III. MB0050 Research Methodology- 4 Credits. (Book ID: B1206 ) Assignment Set- 1 (60 Marks) MBA SEMESTER III MB0050 Research Methodology- 4 Credits (Book ID: B1206 ) Assignment Set- 1 (60 Marks) Note: Each question carries 10 Marks. Answer all the questions Q1. a. Differentiate between nominal,

More information

Marc J. Tassé, PhD Nisonger Center UCEDD

Marc J. Tassé, PhD Nisonger Center UCEDD FINALLY... AN ADAPTIVE BEHAVIOR SCALE FOCUSED ON PROVIDING PRECISION AT THE DIAGNOSTIC CUT-OFF. How Item Response Theory Contributed to the Development of the DABS Marc J. Tassé, PhD UCEDD The Ohio State

More information

BIOSTATISTICAL METHODS AND RESEARCH DESIGNS. Xihong Lin Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA

BIOSTATISTICAL METHODS AND RESEARCH DESIGNS. Xihong Lin Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA BIOSTATISTICAL METHODS AND RESEARCH DESIGNS Xihong Lin Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA Keywords: Case-control study, Cohort study, Cross-Sectional Study, Generalized

More information

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati.

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati. Likelihood Ratio Based Computerized Classification Testing Nathan A. Thompson Assessment Systems Corporation & University of Cincinnati Shungwon Ro Kenexa Abstract An efficient method for making decisions

More information

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing Terry A. Ackerman University of Illinois This study investigated the effect of using multidimensional items in

More information

alternate-form reliability The degree to which two or more versions of the same test correlate with one another. In clinical studies in which a given function is going to be tested more than once over

More information

Test review. Comprehensive Trail Making Test (CTMT) By Cecil R. Reynolds. Austin, Texas: PRO-ED, Inc., Test description

Test review. Comprehensive Trail Making Test (CTMT) By Cecil R. Reynolds. Austin, Texas: PRO-ED, Inc., Test description Archives of Clinical Neuropsychology 19 (2004) 703 708 Test review Comprehensive Trail Making Test (CTMT) By Cecil R. Reynolds. Austin, Texas: PRO-ED, Inc., 2002 1. Test description The Trail Making Test

More information

Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination

Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination Timothy N. Rubin (trubin@uci.edu) Michael D. Lee (mdlee@uci.edu) Charles F. Chubb (cchubb@uci.edu) Department of Cognitive

More information

Gender-Based Differential Item Performance in English Usage Items

Gender-Based Differential Item Performance in English Usage Items A C T Research Report Series 89-6 Gender-Based Differential Item Performance in English Usage Items Catherine J. Welch Allen E. Doolittle August 1989 For additional copies write: ACT Research Report Series

More information

Detection Theory: Sensitivity and Response Bias

Detection Theory: Sensitivity and Response Bias Detection Theory: Sensitivity and Response Bias Lewis O. Harvey, Jr. Department of Psychology University of Colorado Boulder, Colorado The Brain (Observable) Stimulus System (Observable) Response System

More information

Assessing the Validity and Reliability of the Teacher Keys Effectiveness. System (TKES) and the Leader Keys Effectiveness System (LKES)

Assessing the Validity and Reliability of the Teacher Keys Effectiveness. System (TKES) and the Leader Keys Effectiveness System (LKES) Assessing the Validity and Reliability of the Teacher Keys Effectiveness System (TKES) and the Leader Keys Effectiveness System (LKES) of the Georgia Department of Education Submitted by The Georgia Center

More information

RAG Rating Indicator Values

RAG Rating Indicator Values Technical Guide RAG Rating Indicator Values Introduction This document sets out Public Health England s standard approach to the use of RAG ratings for indicator values in relation to comparator or benchmark

More information

Analysis of Confidence Rating Pilot Data: Executive Summary for the UKCAT Board

Analysis of Confidence Rating Pilot Data: Executive Summary for the UKCAT Board Analysis of Confidence Rating Pilot Data: Executive Summary for the UKCAT Board Paul Tiffin & Lewis Paton University of York Background Self-confidence may be the best non-cognitive predictor of future

More information

Importance of Good Measurement

Importance of Good Measurement Importance of Good Measurement Technical Adequacy of Assessments: Validity and Reliability Dr. K. A. Korb University of Jos The conclusions in a study are only as good as the data that is collected. The

More information

Evaluation Models STUDIES OF DIAGNOSTIC EFFICIENCY

Evaluation Models STUDIES OF DIAGNOSTIC EFFICIENCY 2. Evaluation Model 2 Evaluation Models To understand the strengths and weaknesses of evaluation, one must keep in mind its fundamental purpose: to inform those who make decisions. The inferences drawn

More information

Chapter-2 RESEARCH DESIGN

Chapter-2 RESEARCH DESIGN Chapter-2 RESEARCH DESIGN 33 2.1 Introduction to Research Methodology: The general meaning of research is the search for knowledge. Research is also defined as a careful investigation or inquiry, especially

More information

Mantel-Haenszel Procedures for Detecting Differential Item Functioning

Mantel-Haenszel Procedures for Detecting Differential Item Functioning A Comparison of Logistic Regression and Mantel-Haenszel Procedures for Detecting Differential Item Functioning H. Jane Rogers, Teachers College, Columbia University Hariharan Swaminathan, University of

More information

Shiken: JALT Testing & Evaluation SIG Newsletter. 12 (2). April 2008 (p )

Shiken: JALT Testing & Evaluation SIG Newsletter. 12 (2). April 2008 (p ) Rasch Measurementt iin Language Educattiion Partt 2:: Measurementt Scalles and Invariiance by James Sick, Ed.D. (J. F. Oberlin University, Tokyo) Part 1 of this series presented an overview of Rasch measurement

More information

The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests

The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests Mary E. Lunz and Betty A. Bergstrom, American Society of Clinical Pathologists Benjamin D. Wright, University

More information

Write your identification number on each paper and cover sheet (the number stated in the upper right hand corner on your exam cover).

Write your identification number on each paper and cover sheet (the number stated in the upper right hand corner on your exam cover). STOCKHOLM UNIVERSITY Department of Economics Course name: Empirical methods 2 Course code: EC2402 Examiner: Per Pettersson-Lidbom Number of credits: 7,5 credits Date of exam: Sunday 21 February 2010 Examination

More information

Measurement Invariance (MI): a general overview

Measurement Invariance (MI): a general overview Measurement Invariance (MI): a general overview Eric Duku Offord Centre for Child Studies 21 January 2015 Plan Background What is Measurement Invariance Methodology to test MI Challenges with post-hoc

More information

Critical Thinking Assessment at MCC. How are we doing?

Critical Thinking Assessment at MCC. How are we doing? Critical Thinking Assessment at MCC How are we doing? Prepared by Maura McCool, M.S. Office of Research, Evaluation and Assessment Metropolitan Community Colleges Fall 2003 1 General Education Assessment

More information

A Brief Introduction to Bayesian Statistics

A Brief Introduction to Bayesian Statistics A Brief Introduction to Statistics David Kaplan Department of Educational Psychology Methods for Social Policy Research and, Washington, DC 2017 1 / 37 The Reverend Thomas Bayes, 1701 1761 2 / 37 Pierre-Simon

More information

Analysis and Interpretation of Data Part 1

Analysis and Interpretation of Data Part 1 Analysis and Interpretation of Data Part 1 DATA ANALYSIS: PRELIMINARY STEPS 1. Editing Field Edit Completeness Legibility Comprehensibility Consistency Uniformity Central Office Edit 2. Coding Specifying

More information

Development, Standardization and Application of

Development, Standardization and Application of American Journal of Educational Research, 2018, Vol. 6, No. 3, 238-257 Available online at http://pubs.sciepub.com/education/6/3/11 Science and Education Publishing DOI:10.12691/education-6-3-11 Development,

More information

PUBLIC KNOWLEDGE AND ATTITUDES SCALE CONSTRUCTION: DEVELOPMENT OF SHORT FORMS

PUBLIC KNOWLEDGE AND ATTITUDES SCALE CONSTRUCTION: DEVELOPMENT OF SHORT FORMS PUBLIC KNOWLEDGE AND ATTITUDES SCALE CONSTRUCTION: DEVELOPMENT OF SHORT FORMS Prepared for: Robert K. Bell, Ph.D. National Science Foundation Division of Science Resources Studies 4201 Wilson Blvd. Arlington,

More information

Evaluating the quality of analytic ratings with Mokken scaling

Evaluating the quality of analytic ratings with Mokken scaling Psychological Test and Assessment Modeling, Volume 57, 2015 (3), 423-444 Evaluating the quality of analytic ratings with Mokken scaling Stefanie A. Wind 1 Abstract Greatly influenced by the work of Rasch

More information

o^ &&cvi AL Perceptual and Motor Skills, 1965, 20, Southern Universities Press 1965

o^ &&cvi AL Perceptual and Motor Skills, 1965, 20, Southern Universities Press 1965 Ml 3 Hi o^ &&cvi AL 44755 Perceptual and Motor Skills, 1965, 20, 311-316. Southern Universities Press 1965 m CONFIDENCE RATINGS AND LEVEL OF PERFORMANCE ON A JUDGMENTAL TASK 1 RAYMOND S. NICKERSON AND

More information

STATISTICS AND RESEARCH DESIGN

STATISTICS AND RESEARCH DESIGN Statistics 1 STATISTICS AND RESEARCH DESIGN These are subjects that are frequently confused. Both subjects often evoke student anxiety and avoidance. To further complicate matters, both areas appear have

More information

School Annual Education Report (AER) Cover Letter

School Annual Education Report (AER) Cover Letter Lincoln Elementary Sam Skeels, Principal 158 S. Scott St Adrian, MI 49221 Phone: 517-265-8544 School (AER) Cover Letter April 29, 2017 Dear Parents and Community Members: We are pleased to present you

More information

Connexion of Item Response Theory to Decision Making in Chess. Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan

Connexion of Item Response Theory to Decision Making in Chess. Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan Connexion of Item Response Theory to Decision Making in Chess Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan Acknowledgement A few Slides have been taken from the following presentation

More information

Cochrane Pregnancy and Childbirth Group Methodological Guidelines

Cochrane Pregnancy and Childbirth Group Methodological Guidelines Cochrane Pregnancy and Childbirth Group Methodological Guidelines [Prepared by Simon Gates: July 2009, updated July 2012] These guidelines are intended to aid quality and consistency across the reviews

More information

Chapter 11. Experimental Design: One-Way Independent Samples Design

Chapter 11. Experimental Design: One-Way Independent Samples Design 11-1 Chapter 11. Experimental Design: One-Way Independent Samples Design Advantages and Limitations Comparing Two Groups Comparing t Test to ANOVA Independent Samples t Test Independent Samples ANOVA Comparing

More information

Cross-validation of easycbm Reading Cut Scores in Washington:

Cross-validation of easycbm Reading Cut Scores in Washington: Technical Report # 1109 Cross-validation of easycbm Reading Cut Scores in Washington: 2009-2010 P. Shawn Irvin Bitnara Jasmine Park Daniel Anderson Julie Alonzo Gerald Tindal University of Oregon Published

More information

THE USE OF MULTIVARIATE ANALYSIS IN DEVELOPMENT THEORY: A CRITIQUE OF THE APPROACH ADOPTED BY ADELMAN AND MORRIS A. C. RAYNER

THE USE OF MULTIVARIATE ANALYSIS IN DEVELOPMENT THEORY: A CRITIQUE OF THE APPROACH ADOPTED BY ADELMAN AND MORRIS A. C. RAYNER THE USE OF MULTIVARIATE ANALYSIS IN DEVELOPMENT THEORY: A CRITIQUE OF THE APPROACH ADOPTED BY ADELMAN AND MORRIS A. C. RAYNER Introduction, 639. Factor analysis, 639. Discriminant analysis, 644. INTRODUCTION

More information

Validating Measures of Self Control via Rasch Measurement. Jonathan Hasford Department of Marketing, University of Kentucky

Validating Measures of Self Control via Rasch Measurement. Jonathan Hasford Department of Marketing, University of Kentucky Validating Measures of Self Control via Rasch Measurement Jonathan Hasford Department of Marketing, University of Kentucky Kelly D. Bradley Department of Educational Policy Studies & Evaluation, University

More information

Statistics. Nur Hidayanto PSP English Education Dept. SStatistics/Nur Hidayanto PSP/PBI

Statistics. Nur Hidayanto PSP English Education Dept. SStatistics/Nur Hidayanto PSP/PBI Statistics Nur Hidayanto PSP English Education Dept. RESEARCH STATISTICS WHAT S THE RELATIONSHIP? RESEARCH RESEARCH positivistic Prepositivistic Postpositivistic Data Initial Observation (research Question)

More information

Multilevel IRT for group-level diagnosis. Chanho Park Daniel M. Bolt. University of Wisconsin-Madison

Multilevel IRT for group-level diagnosis. Chanho Park Daniel M. Bolt. University of Wisconsin-Madison Group-Level Diagnosis 1 N.B. Please do not cite or distribute. Multilevel IRT for group-level diagnosis Chanho Park Daniel M. Bolt University of Wisconsin-Madison Paper presented at the annual meeting

More information

1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp

1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp The Stata Journal (22) 2, Number 3, pp. 28 289 Comparative assessment of three common algorithms for estimating the variance of the area under the nonparametric receiver operating characteristic curve

More information

linking in educational measurement: Taking differential motivation into account 1

linking in educational measurement: Taking differential motivation into account 1 Selecting a data collection design for linking in educational measurement: Taking differential motivation into account 1 Abstract In educational measurement, multiple test forms are often constructed to

More information

Empirical Formula for Creating Error Bars for the Method of Paired Comparison

Empirical Formula for Creating Error Bars for the Method of Paired Comparison Empirical Formula for Creating Error Bars for the Method of Paired Comparison Ethan D. Montag Rochester Institute of Technology Munsell Color Science Laboratory Chester F. Carlson Center for Imaging Science

More information

PERSONALITY ASSESSMENT PROFICIENCY: REPORT REVIEW FORM

PERSONALITY ASSESSMENT PROFICIENCY: REPORT REVIEW FORM PERSONALITY ASSESSMENT PROFICIENCY: REPORT REVIEW FORM Applicant Name: Reviewer Name: Date: I. Please consider each criteria item as either: Met proficiency criterion (Yes, circle 1 point) or Not met proficiency

More information

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests David Shin Pearson Educational Measurement May 007 rr0701 Using assessment and research to promote learning Pearson Educational

More information

Title: A new statistical test for trends: establishing the properties of a test for repeated binomial observations on a set of items

Title: A new statistical test for trends: establishing the properties of a test for repeated binomial observations on a set of items Title: A new statistical test for trends: establishing the properties of a test for repeated binomial observations on a set of items Introduction Many studies of therapies with single subjects involve

More information

DRAFT (Final) Concept Paper On choosing appropriate estimands and defining sensitivity analyses in confirmatory clinical trials

DRAFT (Final) Concept Paper On choosing appropriate estimands and defining sensitivity analyses in confirmatory clinical trials DRAFT (Final) Concept Paper On choosing appropriate estimands and defining sensitivity analyses in confirmatory clinical trials EFSPI Comments Page General Priority (H/M/L) Comment The concept to develop

More information

CHAPTER - 6 STATISTICAL ANALYSIS. This chapter discusses inferential statistics, which use sample data to

CHAPTER - 6 STATISTICAL ANALYSIS. This chapter discusses inferential statistics, which use sample data to CHAPTER - 6 STATISTICAL ANALYSIS 6.1 Introduction This chapter discusses inferential statistics, which use sample data to make decisions or inferences about population. Populations are group of interest

More information

SUPPLEMENTARY INFORMATION. Table 1 Patient characteristics Preoperative. language testing

SUPPLEMENTARY INFORMATION. Table 1 Patient characteristics Preoperative. language testing Categorical Speech Representation in the Human Superior Temporal Gyrus Edward F. Chang, Jochem W. Rieger, Keith D. Johnson, Mitchel S. Berger, Nicholas M. Barbaro, Robert T. Knight SUPPLEMENTARY INFORMATION

More information

Multi-Specialty Recruitment Assessment Test Blueprint & Information

Multi-Specialty Recruitment Assessment Test Blueprint & Information Multi-Specialty Recruitment Assessment Test Blueprint & Information 1. Structure of the Multi-Specialty Recruitment Assessment... 2 2. Professional Dilemmas Paper... 3 2.2. Context/Setting... 3 2.3. Target

More information