MEASUREMENT DISTURBANCE EFFECTS ON RASCH FIT STATISTICS AND THE LOGIT RESIDUAL INDEX DISSERTATION. Presented to the Graduate Council of the

Size: px
Start display at page:

Download "MEASUREMENT DISTURBANCE EFFECTS ON RASCH FIT STATISTICS AND THE LOGIT RESIDUAL INDEX DISSERTATION. Presented to the Graduate Council of the"

Transcription

1 Vrv^ MEASUREMENT DISTURBANCE EFFECTS ON RASCH FIT STATISTICS AND THE LOGIT RESIDUAL INDEX DISSERTATION Presented to the Graduate Council of the University of North Texas in Partial Fulfillment of the Requirements For the Degree of DOCTOR OF PHILOSOPHY By Robert E. Mount, A.A., B.S., M.A., C.R.C. Denton, Texas August, 1997

2 Mount, Robert E., Measurement disturbance effects on Rasch statistics and the Residual Index. Doctor of Philosophy (Educational Research), August, 1997, 194 pp., 15 tables, 2 illustrations, references, 32 titles. The effects of random guessing as a measurement disturbance on Rasch statistics (unweighted, weighted, and unweighted ability between) and the Residual Index (LRI) were examined through simulated data sets of varying sample sizes, test lengths, and distribution types. Three test lengths (25, 50, and 100), three sample sizes (25, 50, and 100), two difficulty distributions (normal and uniform), and three levels of guessing (no guessing [0%], 25%, and 50%) were used in the simulations, resulting in 54 experimental conditions. The mean logit person ability for each experiment was +1. Each experimental condition was simulated once in an effort to approximate what could happen on the single administration of a four option per multiple choice test to a group of relatively high ability persons. Previous research has shown that varying and person parameters have no effect on Rasch statistics. Consequently, these parameters were used in the present study to establish realistic test conditions, but were not interpreted as effect factors in determining the results of this study. Rasch statistics were found to be robust to varying levels of guessing and to distribution types. The unweighted statistic was more sensitive to problems far away from the average ability of the group in which the problems occurred ( problems away from the difficulties). The weighted statistic was more

3 sensitive to problems centered on the difficulties. It was also found that, as the probability of guessing the correct answer increased, low-ability persons tended consistently to guess the correct answer inducing bias ( familiarity) into the tests. These s were detected by the unweighted between statistic. In conditions involving minor problems and misting s, the LRI was able to identify group membership of the persons in which the problem occurred. Therefore, it is necessary to use the unweighted, weighted, and between statistics in combination with the LRI to diagnose problems for a more accurate assessment of individual differences.

4 Vrv^ MEASUREMENT DISTURBANCE EFFECTS ON RASCH FIT STATISTICS AND THE LOGIT RESIDUAL INDEX DISSERTATION Presented to the Graduate Council of the University of North Texas in Partial Fulfillment of the Requirements For the Degree of DOCTOR OF PHILOSOPHY By Robert E. Mount, A.A., B.S., M.A., C.R.C. Denton, Texas August, 1997

5 TABLE OF CONTENTS Page LIST OF TABLES LIST OF ILLUSTRATIONS v vii Chapter 1. INTRODUCTION 1 Overview Properties of Estimators Rasch Estimation Methods Rationale for the Study Research Question Definition of Terms Delimitations 2. REVIEW OF THE LITERATURE 15 Historical Perspective Measurement Disturbances Rasch Fit Statistics Residual Index Purpose of Study 3. METHODS AND PROCEDURES 33 Data Set Construction Simulated Data Sets Rasch Analysis Statistical Analysis 4. RESULTS 39 Effect of Guessing on Rasch Fit Statistics Simulation Design Effects

6 Table of Content (continued) Chapter Page Detection of Guessing by Rasch Fit Statistics Guessing and the Residual Index 5. FINDINGS AND CONCLUSIONS 59 APPENDIX Effect of Guessing on Rasch Fit Statistics Detection of Guessing by Rasch Fit Statistics Guessing and the Residual Index Summary Conclusions Further Study Recommendations A IP ARM Control File Parameters 72 B A Summary of Item Fit Information by Experiment 76 C A Summary of Misting Item Statistics by Experiment 83 REFERENCES 192

7 LIST OF TABLES Table Page 1. Definition of Experiments Mean Summary of Item Fit Information for Experiments 1-9 (Normally Distributed Item Difficulty Distributions and No Guessing) Mean Summary of Item Fit Information for Experiments (Normally Distributed Item Difficulty Distributions and a 25% Chance of Guessing Correctly) Summary of Mean Item Fit Information for Experiments (Normally Distributed Item Difficulty Distributions and a 50% Chance of Guessing Correctly) Mean Differences for Experiments With Normally Distributed Item Difficulties With No Guessing, 25%, and a 50% Chance of Guessing Correctly Summary of Mean Item Fit Information for Experiments (Uniformly Distributed Item Difficulty Distributions and No Guessing) Summary of Mean Item Fit Information for Experiments (Uniformly Distributed Item Difficulty Distributions and a 25% Chance of Guessing Correctly) Summary of Mean Item Fit Information for Experiments (Uniformly Distributed Item Difficulty Distributions and a 50% Chance of Guessing Correctly) Mean Differences for Experiments With Uniformly Distributed Item Difficulties With No Guessing, 25%, and a 50% Chance of Guessing Correctly 47

8 List of Tables (continued) Table Page 10. Mean Differences for Experiments With Normal and Uniformly Distributed Item Difficulties and Varying Levels of Guessing (No Guessing, 25%, and 50%) A Comparison of the Frequency of Misting Items Detected by Rasch Fit Statistics in a Normal Distribution of Item Difficulties at Varying Test Lengths and Levels of Guessing Using % A Comparison of the Frequency of Misting Items Detected by Rasch Fit Statistics in a Uniform Distribution of Item Difficulties at Varying Test Lengths and Levels of Guessing Using x A Comparison of the Frequency of Misting Items Detected by Rasch Fit Statistics in Normal and Uniformly Distributed Item Difficulties Using % Number and Percent of Misting Items Detected by Rasch Fit Statistics Across Experiments Involving Normal and Uniformly Distributed Item Difficulties Number of Misting Items and Number and Percent of LRI Values by Experimental Conditions 56

9 LIST OF ILLUSTRATIONS Figure Page 1. Rasch Model Notations 7 2. Table Format for the Display of Experimental Data 40

10 CHAPTER 1 INTRODUCTION Overview Stevens (1946) defined measurement as the assignment of numerals to objects or events according to rules. However, the measurement of individual differences is not as straightforward as the definition implies; it is fraught with several measurement problems: (a) No one approach is universally acceptable; (b) measurements are based on limited samples of behaviors; (c) measurements are subject to error; and (d) the units of measurement are not well defined (Crocker & Algina, 1986). Of the problems associated with the measurement of individual differences, the accuracy of measurement is probably the most important. A measure can only be as accurate as the ruler (scale) used to obtain the measurement. Traditionally, the measurement of individual differences was based on the classical true score theory and its associated statistics and scales of measurement (nominal, ordinal, interval, and ratio). The irony is that measurements obtained using this approach were found to be sample specific and that generalizations beyond the reference sample should be made with caution. In addition, the scales of measurement were found to have arbitrary zero points and unequal measurement intervals. Therefore, measurements obtained using these scales may be as arbitrary and unequal as the ruler

11 used to make them. Given these characteristics, estimates (statistics) obtained in identifying individual differences tended to be biased, inconsistent, insufficient, and inefficient when used with nonnormal distributions. What was needed was a true linear scale that had an absolute zero point and equal interval units of measurement. In 1960, one such scale with associated statistics, was developed by Rasch (Rasch, 1980). Known as Rasch Analysis, this approach allowed for the independent estimation of person ability and difficult parameters. In addition, the statistics used in this approach were found to be consistent, efficient, sufficient, and unbiased. In the assessment of function in the Rasch model, one observes the residual trends among ability groups. The Residual Index (LRI), a statistic introduced by Smith (1991b), is a measure of how far an deviates from the common slope that is ted for all s (model curve) and the residual trend among ability groups. Thus, s with LRI values greater than zero will have an characteristic curve (ICC) that is steeper than the modeled, curve and s with values less than zero will have an ICC that is flatter than the modeled curve. The residual trend is an indication of how well, or less well, different ability groups performed on an individual. The purpose of this investigation is to test the effects of varying test parameters and levels of measurement disturbance on Rasch statistics and the Residual Index.

12 Properties of Estimators Rarely, if ever, are characteristics about a complete population known; therefore, researchers make inferences about a population based on a representative random sample taken from the population. A population refers to all members in the entire group having some common characteristic. For example, a population may include all members in a classroom, school, city, community, county, state, nation, or the world. As group membership increases, it becomes increasingly more difficult to obtain measures on all characteristics due to restrictions in time and costs, or the population size increases too rapidly. A population may be finite, a known or countable number of members, or infinite, a population that is so large that group membership is not known. Characteristics about a population are called parameters and characteristics about a sample are called statistics. The most commonly used estimates about populations are measures of central tendency (mean, median, and mode). They identify the most typical measures in a normal distribution. The arithmetic mean is the most commonly used measure of central tendency. It is a simple arithmetic average determined by summing all scores in a distribution of scores and dividing by the number of scores. The mean is an appropriate measure of central tendency when the score distribution is normal and the level of measurement is on the interval or ratio scale. The median is the midpoint in a distribution of scores when arranged in order of magnitude. Stated differently, it is the point on the score scale below which 50 % of the scores fall. It is also equivalent to the 50th percentile. The median is

13 an appropriate measure of central tendency when the level of measurement is on the ordinal scale and the score distributions are other than normal. The mode is the most frequently occurring score in a distribution of scores. However, it is not a dependable measure of central location. Depending on the shape of the distribution, it is possible to have two (bimodal) or more modes (multimodal). The mode is an appropriate measure of central tendency when the level of measurement is on the nominal scale. In a normally distributed population of scores, the mean, median, and mode will coincide or be the same value. However, in nonnormal distributions, these values differ, and, therefore, certain estimators have more desirable properties than others. The desirable properties of estimators are consistency, efficiency, sufficiency, and unbiasness. These serve as criteria for determining preferences for one method of estimation over another. An estimator is considered to be unbiased when the mean of a sampling distribution of means approaches that of the population parameter as the number of samples of a given size increases. That is, a statistic is unbiased when it shows no systematic tendency to be either greater than or less than the population parameter. For example, it can be shown that the variance S 2 =E(X-X) 2 /n is a biased estimate of the population variance (cr 2 ) (Ferguson, 1981).

14 Consistency implies that an estimator more closely approximates a population parameter as sample size increases. The arithmetic mean is a prime example of a consistent estimator. It more closely approximates the population parameter as the sample size increases. Efficiency is implied by sampling variance. It refers to the variability of estimates from sample to sample, or the degree of sampling error associated with the estimator. That is, if the sampling error is less than the sampling error associated with any other method of estimation, the estimate is considered to be efficient (Ferguson, 1981). More explicitly, the estimator that has the smallest standard error is more efficient (Walker & Lev, 1953). An estimator is sufficient for estimating a population parameter if it exhausts all the information about the population parameter from sample data. For example, the mean is a sufficient estimator of the population mean (JJ,), because, once the sample mean is known, any other statistic computed from the sample data (such as the median or mode) would provide no further information about the population mean (Neter, Wasserman, & Whitmore, 1978). Given a normal population, the mean provides an estimate of (j, that satisfies all the desirable properties of a good estimator (consistent, efficient, sufficient, and unbiased) (Walker & Lev, 1953). Rasch Estimation Methods Measurement implies the determination of the quantity, quality, or some other characteristic of an object or attribute. It answers the questions concerning how many,

15 how often, and how much of a particular object or attribute exist. The process involves the assignment of units or numbers in a logical fashion along a dimension or scale. When an object or attribute is measured, it is assigned a specific position along a dimension or numerical scale. Traditionally, we have used four scales or levels of measurement: (a) nominal scale-when numbers, names, or words are used to identify or label individuals or objects; (b) ordinal scale when numbers or words reflect the order of things, (c) interval scale-has equal units of measurement and an arbitrary zero point; and (d) ratio scale has equal units of measurement and a true or absolute zero point. Each scale of measurement has its own rules and makes different assumptions about the measurement process. These scales are prevalent in the traditional classical true score approach to the measurement of human traits. The Rasch measurement model, unlike the classical true score model, attempts to explain the effect of a person's ability on performance. The Rasch model frees the estimation of a person's ability from the difficulty, and the estimation of the difficulty is freed from the person's ability. In short, the more able the person, the better the chances for success on an, and the easier the, the more likely a person is to solve it (Wright & Stone, 1979). It has been shown that no other mathematical model allows the estimation of person ability measures (P v ) and difficulty calibrations^) independent of each other (Anderson, 1973; Barndorff-Nielsen, 1978; Rasch, 1961; Wright & Stone, 1979). The logistic function (probability of a correct response) in the Rasch model,

16 p {*vi- 11 Pv.8,! = exp ( 3 V - 8,)/[l + exp(fi v - 6,)], P b th linearity of scale and generality of measure (Wright & Stone, 1979). Rasch called this particular characteristic "specific objectivity." The symbols and associated definitions used in the Rasch model are presented in Figure 1. = = = = = = = = = = = = = = = ===== 1 r ~ ability level 8 difficulty level " r v test score of person v " L the number of s in the test H the average difficulty level of the test M mean person ability t the variance in difficulties of the test s i an on the test Pi sample p-value of an i V person Pv ability level of person r score on the Test individual difficulty 8i difficulty level person response X vi Figure 1. Rasch model notations. Several Rasch measurement models have been identified: (a)ratingscale, (b) poisson, (c) binominal, (d) dichotomous, (e) partial credit, and (f) many faceted. However, for a measurement model to wo*, there must be some method of estimating its parameters. Rasch identified six estimation methods: (a) the LOG method, (b) the PAIR method, (c) the FCON method, (d) the UCON method, (e) the PROX method, and (f) the UFORM method. Of the six estimation methods, PROX is the only estimation procedure m which and person parameters can be easily calculated by hand.

17 PROX is a normal approximation estimation procedure that expresses difficulty calibrations and person ability measures on a common linear scale. This measurement unit is called a logit. The procedure assumes that: (a) person abilities (P v ) are more or less normally distributed [with mean (jj) and standard deviation (cr)], and (b) difficulties (8j) are assumed to be more or less normally distributed with average difficulty (H) and standard deviation (co). Consequently, the effects of the sample on difficulty calibrations and that of test length on person ability measures can be summarized by means and standard deviations on the variable being measured (Wright & Masters, 1982). The PROX estimation procedure frees the scores from the effects of sample size and test length, then transforms them into a logit measure (Wright & Stone, 1979). The PROX estimation of a person's ability can be found without iteration as b v = H + (1 + (<o 2 /2.89) K In [r v /(L - r v )], with a standard error of SE(b v ) = (1 + (CO 2 /2.89) k [L/r v (l - r v )f, a test height of L H = J] di / L, i and a variance estimate of Item difficulty dj can be found as L fi> 2 = (J> 2 i -LH 2 )/(L-1).

18 d, = M + [1 + O 2 /2.89f In [(1 - Pi)/ Pi ] with a standard error of SE(dj) = (1 + O' 2 /2.89)' /2 [l/n Pi /(l - Pi)]* The Rasch model uses the logit function, In [(1 - PiVPi], to transform the p-value into a linear equal interval scale. In the Rasch model, Pj is calculated as: Pi = Sj/N, where S, is the number of satisfactory responses (correct answers) and N the number of persons. The PROX estimation method is most appropriately used for calibrating new s, because difficulties among a sample of new s tend to approximate a normal distribution, and a sample of persons tends to be normally distributed (Wright & Stone, 1979). It has been found that Rasch estimation methods are unbiased, consistent, efficient, and sufficient (Anderson, 1973; Andrich, 1988; Haberman, 1977; Wright, 1977; Wright & Stone, 1979). Therefore, the Rasch estimation methods are preferred over those of the traditional classical true score model. In the traditional classical true score approach, a person's ability is based on a test score, usually expressed as correct, a percentage of 100, or a percentile rank. The test score has been shown to

19 10 reflect ordinal and curvilinear characteristics which are not conducive to meaningful interpretation. Rationale for Study The measurement of individual differences based on the classical true score approach and its associated statistics has been found to be biased, inefficient, inconsistent, and insufficient, especially with nonnormal distributions. The most often used scales of measurement (nominal, ordinal, and interval) have arbitrary zero points and unequal measurement intervals, and the results are sample specific. In addition, a single error term (standard error of measurement) is used for all examinees (Allen & Yen, 1979). Consequently, there is no way to identify specific objectivity among s and persons (independence of person ability and difficulties) using the classical true score approach. Test data are useful only if there is some correspondence between the s on the test and the latent trait being measured. In addition, the data should the measurement model used in constructing the test. In the classical true score approach, chi-square (% 2 ) and the point biserial correlation are used as indexes of goodness of. Chi-square is used to test the difference between observed and expected events, and the point biserial correlation is sometimes used as an index of for s on a test. The problem with % as a index is that there are different sampling distributions as the degrees of freedom change. With the point biserial correlation, it is unclear what magnitude of value is needed to establish an

20 11 acceptable, mid it is sensitive to the score distribution of the sample. The Rasch model has overcome many of the problems associated with the classical true score approach: 1. A standard error of measurement is provided for each examinee and. 2. The standard error of measurement can be tested for significance. 3. The measurement scale has an absolute zero point and equal interval units. 4. The and person parameters are independent. 5. The parameter estimators are unbiased, consistent, efficient, and sufficient. Given these advantages, the Rasch model provides a more accurate estimate of a person's ability than does the classical true score approach (Allen & Yen, 1979; Wright & Stone, 1979), and it allows for the independent diagnosis of measurement problems associated with s, persons, and by person interactions. Although several research studies have shown the effects of guessing on Rasch statistics, no studies have used the Residual Index (LRI) in conjunction with statistics in helping to identify the effects of measurement disturbances. The purpose of this Monte Carlo simulation is to investigate the effects of varying test parameters and levels of measurement disturbance on Rasch statistics and the Residual Index in the detection of misting s in the single administration of a four option per multiple choice test as experienced in a classroom situation. Rasch statistics are applied to simulated data sets of varying sample sizes, test lengths, difficulty distributions, and levels of measurement disturbance.

21 12 Research Questions Although Rasch estimation models possess the desirable characteristics or properties of estimators, what effects do measurement disturbances have on Rasch statistics and the LRI when varying test lengths, sample sizes, difficulty distributions, and levels of guessing as a measurement disturbance? To test the effects of guessing and varying test parameters on Rasch statistics and the LRI, the following research question is addressed: 1. What effect does medium and high levels of guessing have on Rasch statistics and the LRI when varying sample sizes, test lengths, and difficulty distributions? Definition of Terms Chi square (x )~a descriptive measure of the magnitude of the discrepancies between the observed and expected frequencies. Consistency-an estimate more closely approximates the population parameter as the sample size increases. Estimator a statistic used to determine some characteristic about a population or sample. Efficiency the sampling error associated with a given estimator is less than the sampling error associated with any other method of estimation. Fit the degree to which measurement data approximate the assumptions or characteristics of a particular measurement model. Latent Trait-an ability or characteristic possessed by an individual that cannot be directly observed. In the weighted statistic. -a unit of measurement used in the Rasch model.

22 13 Residual Index-the sum of the chi-squares for an individual. Provides an indication of variations (steepness or flatness) in the characteristic curve. Measurement-the assignment of numerals to objects or events according to rules. Measurement Disturbances-conditions that interfere with the measurement of some underlying psychological construct (aptitude, ability, or attitude). Out-the unweighted statistic. Over s with negative statistics and steeper observed characteristic curves than predicted. Parameters characteristics about a population. Person ability-the amount of a specific trait possessed by an individual that enables that person to answer a test question correctly. Plodding to work slowly on a test and run out of time before attempting all s. Point biserial correlation-the correlation between a continuous variable and a dichotomous variable (correlation between an score and the test score). Population-all members in an entire group having some common characteristic. PROX-a normal approximation estimation procedure that expresses difficulty calibrations and person ability measures on a common linear scale. Start-up-reduced performances at the beginning of a test due to unfamiliarity, anxiety, and so on. Statistics characteristics about a sample. Unbiased-a statistic shows no systematic tendency to be either greater than or less than the population parameter. Under-s with positive statistics and flatter observed characteristic curves than predicted.

23 14 Delimitations For this study, specific parameters related to test lengths, sample sizes (examinees), and difficulty distributions were selected based on a review of previous research. The study is limited to three sample sizes (25, 50, and 100), three test lengths (25, 50, and 100), three levels of guessing (no guessing [0%], 25%, and 50%), and two difficulty distributions (normal and uniform). In addition, the results are based on experimental conditions that simulate the single administration of four option per multiple choice tests as experienced in classroom situations.

24 CHAPTER 2 REVIEW OF THE LITERATURE Historical Perspective "Thorndike (1918) said, whatever exists at all exists in some amount. To know it thoroughly involves knowing its quantity as well as its quality" (Crocker & Algina 1986, p. 4). Psychological constructs, however, are hypothetical abstractions that can be observed only indirectly and the existence of which can never be folly confirmed. Stevens (1946) defined measurement as the assignment of numerals to objects or events according to rules hence, the use of nominal, ordinal, interval, and ratio scales of measurement. Lord and Novick (1968) and Torgerson (1958) noted that measurement applied to the properties of objects rather than the objects themselves. Accordingly, when we measure an individual or object, we are measuring not the object or person, but rather the properties that define the construct or variable possessed by the person whose performance is being measured. Thus, the measurement of such abstractions presents several problems (Crocker & Algina, 1986). To empirically investigate the existence of a trait or property, it is necessary to develop a test theory to guide the investigation. Based upon test theory, we develop tests, the primary tools by which we collect information about individual differences. However, before any measurement can be made, an operational definition of the variable

25 16 of interest must be established. In other words, we must establish some correspondence between the test s and the construct being measured. This correspondence is known as establishing an operational definition. In the literature, this is sometimes referred to as -objective congruence (Crocker & Algina, 1986). The operational definition or common line of inquiry allows the test to define the variable being measured and provide a means for estimating the location of the person taking the test along an ability continuum based on his/her test score (Wright & Stone, 1979). Test scores are meaningful only if (a) they relate to some scale of measurement; (b) they are generalizable beyond the test; and (c) the response pattern is consistent with expectations. Binet, Thurstone, Thorndike, Stevens and others were among the first to develop scales of measurement (Crocker & Algina, 1986). These scales provided rules and meaningful units of measurement for the comparison of individual s and persons in the assessment of individual differences. These scales are predominately used in the assessment of individual differences in the classical true score approach to measurement. In the classical true score approach, a person's observed score (X) is the sum of two unobservable scores, a true score (T) and an error score (E). The observed score is defined as X =1 +E. The true score is defined as the average score resulting from an infinite number of repeated testing with the same instrument. The error score is the difference between the observed score and the true score. Measurements obtained using this approach are

26 17 usually expressed as correlations, percentile ranks, z-scores, t-scores, or scaled scores. It has also been shown that these statistics are sample specific. The disadvantages of the classical true score approach are that a single error term (standard error of measurement) is used for all examinees and that the difficulties are related to the number of persons who answered the s correctly. It has also been shown that, as the reference group changes, so does the measured performance of the person taking the test. To what degree of certainty then are these measurements valid for generalizations beyond the reference sample? What was needed was a test theory or model that allowed for the separate and independent estimation of both and person parameters, something the classical true score approach did not take into account. One such model was introduced by Rasch in 1960 along with a true linear scale of measurement (Rasch, 1980). When using the Rasch model, generalizations beyond the test are based on several assumptions: 1. The test theory used in the development of the test is appropriate. 2. The s on the test define the variable being measured. 3. The test score gives us some indication of the properties that define the variable that is possessed by the person taking the test. 4. The scale of measurement is linear. 5. The and person parameters are independent. Thus, when the Rasch model is ted properly, the criteria of independence of sample and s were satisfied, and generalizations beyond the test can be sufficiently made.

27 18 Measurement Disturbances Measurement disturbances are conditions that interfere with the measurement of some underlying psychological construct, such as aptitude, ability, or attitude (Smith, 1991b). With respect to the Rasch model, only two conditions determine the outcome of the interaction between the person and any on the test: (a) the amount of the trait possessed by the person and (b) the amount of the trait necessary to provide a certain response to a given stimulus (Smith, 1991b). These conditions are commonly referred to as P erson ability and difficulty. Any other condition that influences measurement is considered noise in the measurement process. Measurement disturbances that are characteristics of the person and independent of the s include, but are not limited to (a) start-up, '(b) plodding, (c) cheating, (d) illness, (e) external distractions, (f) guessing, (g) boredom, and (h) fatigue (Smith, 1991b). Measurement disturbances associated with the interaction of the person and the properties of the s are (a) guessing, (b) sloppiness, (c) content, (d) type, and (e) bias. Examples of the third type of measurement disturbance may include such things as typographical errors, unrelated answer choices, and s unrelated to content. The most common types of measurement disturbances are cheating and guessing (Smith, 1991b). Thorndike (1949) developed a list of possible disturbances to the measurement process. Smith (1985) later classified measurement disturbances into three general categories: (a) disturbances that are the results of characteristics of the person that are

28 19 independent of the s, (b) disturbances that are the interaction between the characteristics of the person and the properties of the s, and (c) disturbances that are the results of the properties of the s that are independent of the characteristics of the person. The classification of measurement disturbances is important in that the source of measurement disturbances dictates the techniques necessary to detect its presence (Smith, 1991b). Glaser (1949,1952) and Mosier (1941) felt that a person would exhibit consistently correct answers to relatively easy s, consistently incorrect responses to difficult s, and inconsistent responses to s centered on their ability level. Since inconsistent responses could be associated with measurement disturbances, Thurstone and Chave (1929) believed that some criterion should be established such that inconsistent records should be eliminated from the tabulation. Thus, persons with perfect scores (all correct) and persons with no s correct (score of zero) are eliminated from Rasch and person analysis. The detection of measurement disturbances can be divided into two general categories: an investigation of the structure of the entire response matrix (an investigation of the of the responses to individual s [ ]), and an investigation of the of the responses for an individual person (person ) (Smith, 1991b). Once a measurement disturbance has been detected, there are four possible responses: (a) ignore the problem, (b) assume everyone has the problem and make a correction, (c) use some method of robust estimation, or (d) use the available information about the s and the people to

29 20 make a systematic analysis of each individual's response patterns. If a measurement disturbance is noticed with person analysis, there are four possible alternative actions: (a) accept the original estimate, (b) modify the response pattern and re-estimate ability, (c) report only subset ability estimates and no ability, or (d) decide that there is not enough information in the responses to report any ability estimate (Smith, 1991b). An analysis of for the entire response matrix does not require additional information about the s or persons. It can be based solely on the observed responses, but it is more useful when based on some characteristic of the persons (age, gender, native language, or ethnic origin). These characteristics can be used to create subgroups of persons that can be used to test the invariance of the difficulty parameters. Person analysis is more useful when based on groups of s that have the potential to evoke measurement noise in certain groups of persons. However, there are some measurement disturbances that cannot be easily identified in either s or persons (Smith, 1991b). In 1982 Smith compared the weighted and unweighted between statistics as applied to persons and found that (a) the mean and the standard deviation of the two statistics were almost identical; (b) the correlation between the two statistics was very high (r =.99); (c) the Type I error rates were almost identical; and (d) there was high correspondence between s and persons identified as misting by the two statistics (Smith, 1991b). Anderson (1973), Gustafsson (1980), and Wollenberg (1982) suggested the use of the likelihood ratio chi-square test as an alternative to the between statistic because the

30 21 distributional properties of the Pearson chi-square are not known. Smith and Hedges (1982) demonstrated that the distributions of the Pearson chi-square and the likelihood ratio chi-square were almost identical in data simulated to the Rasch model and data designed to simulate various forms of measurement disturbances. Smith (1991a) examined the effects of test length, number of persons, difficulty distributions, person abilities, and the number of steps in each on the mean squares. The results suggested that (a) responses are discrete rather than continuous variables and have little influence on the distribution of the statistics; and (b) estimated and person parameters appear to have little effect on the mean of the statistics, but seem to reduce the standard deviation. Further simulations by Smith (1991a) studied the effects of test length, number of persons, range of difficulties, and offset between person ability and difficulty distributions. The results suggested correction factors for the restrictions imposed by the use of both estimated difficulties and person abilities in the analysis. Because there was a magnitude of difference between the weighted and unweighted versions of the statistics, two correction factors were proposed. Smith (1988b) performed several simulations to assess the distributional properties of the weighted and unweighted between statistics. These simulations involved 10 replications of 1,000 persons taking a 20- test, with the difficulties uniformly distributed from -1 to +1 logits. The results showed that, as the number of ability groups increased, the mean and standard deviation of the transformed values

31 22 approached the hypothesized values of 0,1. Additional simulations studied the effect of increasing the number of persons and number of s, varying the dispersion of difficulties, and varying the offset between the mean of the and the person distributions. The results indicated that, within the ranges studied, varying these factors had little effect on the distribution of the transformed values. Thus, there appears to be no reason to develop correction factors such as those developed for the weighted and unweighted statistic to correct for the influence of these factors on the distribution and Type I error rate of the between statistics (Smith 1991a). Smith (1988a, 1991a) also studied the power of the and between statistics to detect two types of measurement disturbances, bias and guessing when unknown. These studies found that the weighted, unweighted, and between statistics were capable of detecting different types of measurement disturbances. The between statistic was more efficient at detecting bias than either the unweighted or weighted statistic. The unweighted and weighted statistics were more sensitive to disturbances such as guessing and start-up. The primary difference between the two statistics is that the unweighted version is based on the sum of the standardized residuals, whereas the weighted version is based on the sum of the standardized residuals that have been weighted by the information function. For s far away from the person's estimated ability, the weighting process makes the weighted statistic less sensitive to residuals from those s. Systematic identification and evaluation of measurement disturbances were also demonstrated by Wright (1977), Wright and Stone

32 23 (1979), and Wright and Masters (1982). Unless one is looking for a specific type of measurement disturbance, it seems necessary to use both the and between statistics in the analysis of information. Smith (1986,1988a, 1991b) and Smith and Hedges (1982) studied the power of the and between statistics to detect two types of measurement disturbances, bias and guessing. These studies found that the and between statistics were capable of detecting different types of measurement disturbances. The between statistic was more efficient at detecting bias, and the statistics were more sensitive in detecting disturbances such as guessing and start-up. Rasch Fit Statistics Prior to the development of computers, the calculation of Rasch indexes were not practical. In fact, the first statistic developed for use with Rasch calibration computer programs was the overall chi-square statistic (Smith, 1991b). This statistic was based on the Pearson chi-square typically used in the statistics developed by Wright (1977). The overall chi-square statistic was developed to be used with dichotomously scored test s to assess the of the entire data matrix to the Rasch measurement model rather than assessing the of individual s or persons (Smith, 1991b). The overall chi-square L m i=1 j=l

33 is formed by summing a version of the squared standardized residual for the entire matrix 24 where in is the number of raw score persons or groups L -1) and L is the number of s on the test with (L - l)(m -1) degrees of freedom (Smith, 1991b; Wright & Panchapakesan, 1969). The standardized residual is defined as y aij - (n)(pij) where a^ is the observed number of correct responses for persons with a raw score j, rj is the number of persons with raw score j, and Py is the probability of a correct response on that for group j (Smith, 1991b, p. 165). Anderson (1973) developed an additional index of overall based on the likelihood-ratio chi-square. Wright and Panchapakesan (1969) also developed a statistic known as the chi-square, which can be used to test the of responses to individual s. These statistics are referred to as between ability group statistics. Traditionally, the point biserial correlation offered a rough estimate of ; however, this statistic is sample specific. That is, it is dependent upon the score distribution of the sample. Anderson (1973) and Bamdorff-Nielson (1978) have shown that only difficulty is necessary for consistent and sufficient estimates. Rasch suggested several statistics to assess the of data to his measurement models, the weighted and the unweighted statistics. The weighted statistic is referred to as in and the unweighted statistic is referred to as out. In the weighted version, a greater weight is given to unexpected responses to

34 25 s near the person's logit measure (ability), and in the unweighted version, a greater weight is given to unexpected responses that are farther away from the person's logit measure (Wright & Stone, 1979). The statistics evaluate the general agreement between the variable defined by the and the variable defined by all other s over the whole sample. The weighted statistic was developed to diminish the effect of anomalous outliers, which results in unusually large mean squares (Smith, 1991b). This is evident when an unexpected number of low-ability persons answer difficult s correctly and an unexpected number of high-ability persons answer easy s incorrectly at the beginning of the test. These statistics are sensitive to measurement disturbances, such as guessing, start-up, highly discriminating s, and very easy s, but are relatively insensitive to systematic disturbances such as bias (Smith, 1991b). BICAL, a computer program used to test, was developed by Wright and Mead (1978). This program uses the unweighted versions of two statistics, the and between statistics. The between statistic is based on the number of ability groups, and the statistic is based on the /person residual rather than the /ability level residual (Smith, 1991b). Later versions of BICAL introduced a log transformation in an attempt to standardize the statistics to an approximate unit normal distribution. These transformations were introduced because the mean squares that indicated possible mis varied from to and analysis to analysis, depending i on

35 26 the number of persons, difficulty distributions, and the distribution of person abilities (Smith, 1991b). The latest version of BICAL uses a cube root transformation to convert the mean squares of the unweighted and between statistics into approximate unit normals (Smith, 1991b). However, these statistics are sensitive to start-up, guessing, large ranges of difficulties, person abilities, and easy s, thereby producing large mean squares (mis). This latest version also introduces the weighted version of the statistic, which replaced the unweighted version. Wright and Masters (1982) expanded the notion of to two polychotomous Rasch models, the rating scale and partial credit models. With this addition, Rasch statistics are now available for models with other than dichotomously scored s. MSCALE, CREDIT, BIGSCALE, and BIGSTEPS are among the most recent Rasch calibration programs. The primary purpose of these programs is to estimate and person parameters from a collection of responses to the s (Smith, 1991b). These programs contain both the unweighted and weighted statistics. These two statistics accentuate different parts of the -person relationship. Although there is a high correlation between the two statistics, the difference between the two can help diagnose different types of measurement disturbances. The statistics are more sensitive to measurement disturbances such as guessing, where unusual numbers of low-ability examinees give correct answers to difficult s, and start-up, where an unusually high number of high-ability examinees give incorrect responses to easy s

36 27 at the beginning of the test. The statistics are also sensitive to variation in the characteristic curve. IP ARM ( and person analysis with the Rasch model) is an analysis and person analysis computer program for dichotomous and rating scale data. The major advantage of IP ARM is that it constructs between statistics based on characteristics of the person for analysis or properties of the s for person analysis. It is the only software program that provides between statistics (unweighted version) for biographical subpopulations (Smith, 1991b). When biographical data are used in the analysis, it is a direct test of the invariance of the estimation of the difficulty parameter over ability groups (Smith, 1991b). When demographic data are used in the analysis to create subgroups (sex, race, and age), the resulting statistic will give an indication of the presence of bias, or differential familiarity, in response patterns for the s (Smith, 1991b). In analysis with IPARM, the software first calculates the mean squares associated with the Rasch statistics then converts them to their associated statistic with a cube root transformation. The unweighted mean square (UMSj) is defined as 2 UMSJ= f Z N^lPn(\-P n )\ Ntt where N is the number of people, X n is the observed response, and P n is the response predicted from the logit difficulty of the and the logit ability of the person (Smith,

37 b). In the Rasch model, the probability of a correct response X vi by person v with ability p v to i with difficulty (8;) can be found as P Ki= 1 I Mi} = ex P (Pv" 8i)/[l + exp((3 v - 8j)] (Wright & Stone, 1979), or _ exp (bj-di) lij, 1 + Qxp(bj - di) where bj is the ability measure for persons in score group j (Smith, 1991b, p. 153). Although these formulas appear to be considerably different, they yield the same results provided the person's ability measure is the same. The standard deviation of the unweighted mean square s can be found as S[MS(UT)i\ = N i izi w TV 1/2 The weighted mean square (WMSj) can be calculated as j^ixn-pnf WMSj = YWn n=1 where W, the weighting function, can be calculated as with a standard deviation of W= [P(l-P)]

38 29 $[MS(WT)i\ = " N N Wni -42>* _«=1 n=1 2 X n=1 1/2 (Smith, 1991b). The unweighted between mean square (UBMSj) is defined as UBMSj = l (J- l) 2L ^ w ; P (l- P ) where / is the number of score groups, N, is the number of persons in each score group, X n is the observed response for person n, and P is the predicted response for person v (Smith 1991b, p. 32). The unweighted between standard deviation can be approximated by 1/2 LC-i)J Once calculated, the mean squares are converted to unit normal statistics by the following cube root transformation formula: where V, the mean square, and S, the standard deviation, are the values associated with the mean square under consideration (Smith, 1991b; Wright & Masters, 1982). The resulting statistics have expected values of 0,1 (mean of 0 and a standard deviation of one 1).

39 30 Residual Index The Residual Index (LRI) provides a reference as to the flatness or steepness of the characteristic curve (ICC). It indicates the linear trend of the residuals summed over persons for each s. The LRI can be calculated as N LRIi = ( Y n i - Y. i ) ( b n ~ d i ) ji "Zibn-dif n-1 where dj is the difficulty of the, b is the ability of the person, N is the number of persons, X ni is the observed response, and P ni is the predicted response and a standardized residual Y ni Yni = { X n i - P n i ) v ^ Yni ~ ~ / Pni(\ - Pni) t t N ' (Smith, 1991b, p. 30). As one of the output variables from IP ARM, the LRI is a measure of how far an deviates from the common slope that is ted for all s. The index has an expected value of zero (Smith, 1991b). That is, an with an ICC that s the modeled common curve will have an LRI value of zero. Therefore, an with an LRI value greater than zero will have an ICC that is steeper than the modeled curve, and s with an LRI value less than zero will have an ICC that is flatter than the modeled curve. (Note: In the traditional classical true score approach, the point biserial correlation is used to provide an estimate of the slope of the observed characteristic curve. However,

40 31 the point biserial correlation has been found to be sample specific, and no discrete values have been established to provide an accurate interpretation of the correlation coefficient as related to the slope of the characteristic curve). IP ARM automatically assigns group membership (ability groups) based on the performance of each person on each. The program attempts to place an equal number of persons in each ability group. Negative LRI values indicate that low-ability groups should have positive residuals and high-ability groups should have negative residuals. This indicates that low-ability persons performed better than expected and high-ability persons performed less well than expected. Positive LRI values indicate that low-ability groups should have negative residuals and high-ability groups should have positive residuals. This indicates that high-ability persons performed better than expected and low-ability persons performed less well than expected. Items with negative statistics tend to have steeper observed ICCs than predicted, indicating an over to the model, and s with positive statistics tend to have flatter observed ICCs than predicted, indicating an under to the model. Purpose of Study Smith (1991b) proposed that statistics in Rasch calibration programs provide a frame of reference forjudging performance and that one way of establishing this frame of reference is to simulate data that the Rasch model over a variety of test conditions. The present Monte Carlo study investigated the effects of varying difficulty distributions, number of persons, number of s, and levels of

Unit 1 Exploring and Understanding Data

Unit 1 Exploring and Understanding Data Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile

More information

Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys

Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys Jill F. Kilanowski, PhD, APRN,CPNP Associate Professor Alpha Zeta & Mu Chi Acknowledgements Dr. Li Lin,

More information

INVESTIGATING FIT WITH THE RASCH MODEL. Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form

INVESTIGATING FIT WITH THE RASCH MODEL. Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form INVESTIGATING FIT WITH THE RASCH MODEL Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form of multidimensionality. The settings in which measurement

More information

On the purpose of testing:

On the purpose of testing: Why Evaluation & Assessment is Important Feedback to students Feedback to teachers Information to parents Information for selection and certification Information for accountability Incentives to increase

More information

Connexion of Item Response Theory to Decision Making in Chess. Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan

Connexion of Item Response Theory to Decision Making in Chess. Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan Connexion of Item Response Theory to Decision Making in Chess Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan Acknowledgement A few Slides have been taken from the following presentation

More information

Still important ideas

Still important ideas Readings: OpenStax - Chapters 1 13 & Appendix D & E (online) Plous Chapters 17 & 18 - Chapter 17: Social Influences - Chapter 18: Group Judgments and Decisions Still important ideas Contrast the measurement

More information

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo Please note the page numbers listed for the Lind book may vary by a page or two depending on which version of the textbook you have. Readings: Lind 1 11 (with emphasis on chapters 10, 11) Please note chapter

More information

Still important ideas

Still important ideas Readings: OpenStax - Chapters 1 11 + 13 & Appendix D & E (online) Plous - Chapters 2, 3, and 4 Chapter 2: Cognitive Dissonance, Chapter 3: Memory and Hindsight Bias, Chapter 4: Context Dependence Still

More information

Business Statistics Probability

Business Statistics Probability Business Statistics The following was provided by Dr. Suzanne Delaney, and is a comprehensive review of Business Statistics. The workshop instructor will provide relevant examples during the Skills Assessment

More information

Technical Specifications

Technical Specifications Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically

More information

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES Correlational Research Correlational Designs Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are

More information

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo Business Statistics The following was provided by Dr. Suzanne Delaney, and is a comprehensive review of Business Statistics. The workshop instructor will provide relevant examples during the Skills Assessment

More information

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions Readings: OpenStax Textbook - Chapters 1 5 (online) Appendix D & E (online) Plous - Chapters 1, 5, 6, 13 (online) Introductory comments Describe how familiarity with statistical methods can - be associated

More information

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions Readings: OpenStax Textbook - Chapters 1 5 (online) Appendix D & E (online) Plous - Chapters 1, 5, 6, 13 (online) Introductory comments Describe how familiarity with statistical methods can - be associated

More information

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida Vol. 2 (1), pp. 22-39, Jan, 2015 http://www.ijate.net e-issn: 2148-7456 IJATE A Comparison of Logistic Regression Models for Dif Detection in Polytomous Items: The Effect of Small Sample Sizes and Non-Normality

More information

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories Kamla-Raj 010 Int J Edu Sci, (): 107-113 (010) Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories O.O. Adedoyin Department of Educational Foundations,

More information

A Comparison of Several Goodness-of-Fit Statistics

A Comparison of Several Goodness-of-Fit Statistics A Comparison of Several Goodness-of-Fit Statistics Robert L. McKinley The University of Toledo Craig N. Mills Educational Testing Service A study was conducted to evaluate four goodnessof-fit procedures

More information

Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F

Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F Plous Chapters 17 & 18 Chapter 17: Social Influences Chapter 18: Group Judgments and Decisions

More information

THE NATURE OF OBJECTIVITY WITH THE RASCH MODEL

THE NATURE OF OBJECTIVITY WITH THE RASCH MODEL JOURNAL OF EDUCATIONAL MEASUREMENT VOL. II, NO, 2 FALL 1974 THE NATURE OF OBJECTIVITY WITH THE RASCH MODEL SUSAN E. WHITELY' AND RENE V. DAWIS 2 University of Minnesota Although it has been claimed that

More information

STATISTICS AND RESEARCH DESIGN

STATISTICS AND RESEARCH DESIGN Statistics 1 STATISTICS AND RESEARCH DESIGN These are subjects that are frequently confused. Both subjects often evoke student anxiety and avoidance. To further complicate matters, both areas appear have

More information

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, April 23-25, 2003 The Classification Accuracy of Measurement Decision Theory Lawrence Rudner University

More information

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups GMAC Scaling Item Difficulty Estimates from Nonequivalent Groups Fanmin Guo, Lawrence Rudner, and Eileen Talento-Miller GMAC Research Reports RR-09-03 April 3, 2009 Abstract By placing item statistics

More information

11/24/2017. Do not imply a cause-and-effect relationship

11/24/2017. Do not imply a cause-and-effect relationship Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are highly extraverted people less afraid of rejection

More information

Readings: Textbook readings: OpenStax - Chapters 1 11 Online readings: Appendix D, E & F Plous Chapters 10, 11, 12 and 14

Readings: Textbook readings: OpenStax - Chapters 1 11 Online readings: Appendix D, E & F Plous Chapters 10, 11, 12 and 14 Readings: Textbook readings: OpenStax - Chapters 1 11 Online readings: Appendix D, E & F Plous Chapters 10, 11, 12 and 14 Still important ideas Contrast the measurement of observable actions (and/or characteristics)

More information

Readings: Textbook readings: OpenStax - Chapters 1 4 Online readings: Appendix D, E & F Online readings: Plous - Chapters 1, 5, 6, 13

Readings: Textbook readings: OpenStax - Chapters 1 4 Online readings: Appendix D, E & F Online readings: Plous - Chapters 1, 5, 6, 13 Readings: Textbook readings: OpenStax - Chapters 1 4 Online readings: Appendix D, E & F Online readings: Plous - Chapters 1, 5, 6, 13 Introductory comments Describe how familiarity with statistical methods

More information

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo Please note the page numbers listed for the Lind book may vary by a page or two depending on which version of the textbook you have. Readings: Lind 1 11 (with emphasis on chapters 5, 6, 7, 8, 9 10 & 11)

More information

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Thakur Karkee Measurement Incorporated Dong-In Kim CTB/McGraw-Hill Kevin Fatica CTB/McGraw-Hill

More information

The Influence of Test Characteristics on the Detection of Aberrant Response Patterns

The Influence of Test Characteristics on the Detection of Aberrant Response Patterns The Influence of Test Characteristics on the Detection of Aberrant Response Patterns Steven P. Reise University of California, Riverside Allan M. Due University of Minnesota Statistical methods to assess

More information

2.75: 84% 2.5: 80% 2.25: 78% 2: 74% 1.75: 70% 1.5: 66% 1.25: 64% 1.0: 60% 0.5: 50% 0.25: 25% 0: 0%

2.75: 84% 2.5: 80% 2.25: 78% 2: 74% 1.75: 70% 1.5: 66% 1.25: 64% 1.0: 60% 0.5: 50% 0.25: 25% 0: 0% Capstone Test (will consist of FOUR quizzes and the FINAL test grade will be an average of the four quizzes). Capstone #1: Review of Chapters 1-3 Capstone #2: Review of Chapter 4 Capstone #3: Review of

More information

Psychology Research Process

Psychology Research Process Psychology Research Process Logical Processes Induction Observation/Association/Using Correlation Trying to assess, through observation of a large group/sample, what is associated with what? Examples:

More information

Differential Item Functioning

Differential Item Functioning Differential Item Functioning Lecture #11 ICPSR Item Response Theory Workshop Lecture #11: 1of 62 Lecture Overview Detection of Differential Item Functioning (DIF) Distinguish Bias from DIF Test vs. Item

More information

Mantel-Haenszel Procedures for Detecting Differential Item Functioning

Mantel-Haenszel Procedures for Detecting Differential Item Functioning A Comparison of Logistic Regression and Mantel-Haenszel Procedures for Detecting Differential Item Functioning H. Jane Rogers, Teachers College, Columbia University Hariharan Swaminathan, University of

More information

The Effect of Guessing on Item Reliability

The Effect of Guessing on Item Reliability The Effect of Guessing on Item Reliability under Answer-Until-Correct Scoring Michael Kane National League for Nursing, Inc. James Moloney State University of New York at Brockport The answer-until-correct

More information

Understandable Statistics

Understandable Statistics Understandable Statistics correlated to the Advanced Placement Program Course Description for Statistics Prepared for Alabama CC2 6/2003 2003 Understandable Statistics 2003 correlated to the Advanced Placement

More information

3 CONCEPTUAL FOUNDATIONS OF STATISTICS

3 CONCEPTUAL FOUNDATIONS OF STATISTICS 3 CONCEPTUAL FOUNDATIONS OF STATISTICS In this chapter, we examine the conceptual foundations of statistics. The goal is to give you an appreciation and conceptual understanding of some basic statistical

More information

Chapter 2 Norms and Basic Statistics for Testing MULTIPLE CHOICE

Chapter 2 Norms and Basic Statistics for Testing MULTIPLE CHOICE Chapter 2 Norms and Basic Statistics for Testing MULTIPLE CHOICE 1. When you assert that it is improbable that the mean intelligence test score of a particular group is 100, you are using. a. descriptive

More information

A simulation study of person-fit in the Rasch model

A simulation study of person-fit in the Rasch model Psychological Test and Assessment Modeling, Volume 58, 2016 (3), 531-563 A simulation study of person-fit in the Rasch model Richard Artner 1 Abstract The validation of individual test scores in the Rasch

More information

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses Item Response Theory Steven P. Reise University of California, U.S.A. Item response theory (IRT), or modern measurement theory, provides alternatives to classical test theory (CTT) methods for the construction,

More information

VARIABLES AND MEASUREMENT

VARIABLES AND MEASUREMENT ARTHUR SYC 204 (EXERIMENTAL SYCHOLOGY) 16A LECTURE NOTES [01/29/16] VARIABLES AND MEASUREMENT AGE 1 Topic #3 VARIABLES AND MEASUREMENT VARIABLES Some definitions of variables include the following: 1.

More information

Standard Scores. Richard S. Balkin, Ph.D., LPC-S, NCC

Standard Scores. Richard S. Balkin, Ph.D., LPC-S, NCC Standard Scores Richard S. Balkin, Ph.D., LPC-S, NCC 1 Normal Distributions While Best and Kahn (2003) indicated that the normal curve does not actually exist, measures of populations tend to demonstrate

More information

Bruno D. Zumbo, Ph.D. University of Northern British Columbia

Bruno D. Zumbo, Ph.D. University of Northern British Columbia Bruno Zumbo 1 The Effect of DIF and Impact on Classical Test Statistics: Undetected DIF and Impact, and the Reliability and Interpretability of Scores from a Language Proficiency Test Bruno D. Zumbo, Ph.D.

More information

Description of components in tailored testing

Description of components in tailored testing Behavior Research Methods & Instrumentation 1977. Vol. 9 (2).153-157 Description of components in tailored testing WAYNE M. PATIENCE University ofmissouri, Columbia, Missouri 65201 The major purpose of

More information

Chapter 2--Norms and Basic Statistics for Testing

Chapter 2--Norms and Basic Statistics for Testing Chapter 2--Norms and Basic Statistics for Testing Student: 1. Statistical procedures that summarize and describe a series of observations are called A. inferential statistics. B. descriptive statistics.

More information

Empirical Knowledge: based on observations. Answer questions why, whom, how, and when.

Empirical Knowledge: based on observations. Answer questions why, whom, how, and when. INTRO TO RESEARCH METHODS: Empirical Knowledge: based on observations. Answer questions why, whom, how, and when. Experimental research: treatments are given for the purpose of research. Experimental group

More information

Using Differential Item Functioning to Test for Inter-rater Reliability in Constructed Response Items

Using Differential Item Functioning to Test for Inter-rater Reliability in Constructed Response Items University of Wisconsin Milwaukee UWM Digital Commons Theses and Dissertations May 215 Using Differential Item Functioning to Test for Inter-rater Reliability in Constructed Response Items Tamara Beth

More information

CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA

CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA Data Analysis: Describing Data CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA In the analysis process, the researcher tries to evaluate the data collected both from written documents and from other sources such

More information

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison Empowered by Psychometrics The Fundamentals of Psychometrics Jim Wollack University of Wisconsin Madison Psycho-what? Psychometrics is the field of study concerned with the measurement of mental and psychological

More information

Development, Standardization and Application of

Development, Standardization and Application of American Journal of Educational Research, 2018, Vol. 6, No. 3, 238-257 Available online at http://pubs.sciepub.com/education/6/3/11 Science and Education Publishing DOI:10.12691/education-6-3-11 Development,

More information

Modeling DIF with the Rasch Model: The Unfortunate Combination of Mean Ability Differences and Guessing

Modeling DIF with the Rasch Model: The Unfortunate Combination of Mean Ability Differences and Guessing James Madison University JMU Scholarly Commons Department of Graduate Psychology - Faculty Scholarship Department of Graduate Psychology 4-2014 Modeling DIF with the Rasch Model: The Unfortunate Combination

More information

Chapter 1: Exploring Data

Chapter 1: Exploring Data Chapter 1: Exploring Data Key Vocabulary:! individual! variable! frequency table! relative frequency table! distribution! pie chart! bar graph! two-way table! marginal distributions! conditional distributions!

More information

The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests

The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests Mary E. Lunz and Betty A. Bergstrom, American Society of Clinical Pathologists Benjamin D. Wright, University

More information

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1 Nested Factor Analytic Model Comparison as a Means to Detect Aberrant Response Patterns John M. Clark III Pearson Author Note John M. Clark III,

More information

Section 5. Field Test Analyses

Section 5. Field Test Analyses Section 5. Field Test Analyses Following the receipt of the final scored file from Measurement Incorporated (MI), the field test analyses were completed. The analysis of the field test data can be broken

More information

Citation for published version (APA): Ebbes, P. (2004). Latent instrumental variables: a new approach to solve for endogeneity s.n.

Citation for published version (APA): Ebbes, P. (2004). Latent instrumental variables: a new approach to solve for endogeneity s.n. University of Groningen Latent instrumental variables Ebbes, P. IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document

More information

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments Greg Pope, Analytics and Psychometrics Manager 2008 Users Conference San Antonio Introduction and purpose of this session

More information

THE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH

THE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH THE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH By JANN MARIE WISE MACINNES A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF

More information

CHAPTER 3 RESEARCH METHODOLOGY

CHAPTER 3 RESEARCH METHODOLOGY CHAPTER 3 RESEARCH METHODOLOGY 3.1 Introduction 3.1 Methodology 3.1.1 Research Design 3.1. Research Framework Design 3.1.3 Research Instrument 3.1.4 Validity of Questionnaire 3.1.5 Statistical Measurement

More information

Chapter 5: Field experimental designs in agriculture

Chapter 5: Field experimental designs in agriculture Chapter 5: Field experimental designs in agriculture Jose Crossa Biometrics and Statistics Unit Crop Research Informatics Lab (CRIL) CIMMYT. Int. Apdo. Postal 6-641, 06600 Mexico, DF, Mexico Introduction

More information

Appendix B Statistical Methods

Appendix B Statistical Methods Appendix B Statistical Methods Figure B. Graphing data. (a) The raw data are tallied into a frequency distribution. (b) The same data are portrayed in a bar graph called a histogram. (c) A frequency polygon

More information

PRINCIPLES OF STATISTICS

PRINCIPLES OF STATISTICS PRINCIPLES OF STATISTICS STA-201-TE This TECEP is an introduction to descriptive and inferential statistics. Topics include: measures of central tendency, variability, correlation, regression, hypothesis

More information

Measurement and Descriptive Statistics. Katie Rommel-Esham Education 604

Measurement and Descriptive Statistics. Katie Rommel-Esham Education 604 Measurement and Descriptive Statistics Katie Rommel-Esham Education 604 Frequency Distributions Frequency table # grad courses taken f 3 or fewer 5 4-6 3 7-9 2 10 or more 4 Pictorial Representations Frequency

More information

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian Recovery of Marginal Maximum Likelihood Estimates in the Two-Parameter Logistic Response Model: An Evaluation of MULTILOG Clement A. Stone University of Pittsburgh Marginal maximum likelihood (MML) estimation

More information

APPENDIX N. Summary Statistics: The "Big 5" Statistical Tools for School Counselors

APPENDIX N. Summary Statistics: The Big 5 Statistical Tools for School Counselors APPENDIX N Summary Statistics: The "Big 5" Statistical Tools for School Counselors This appendix describes five basic statistical tools school counselors may use in conducting results based evaluation.

More information

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model Gary Skaggs Fairfax County, Virginia Public Schools José Stevenson

More information

Descriptive Statistics Lecture

Descriptive Statistics Lecture Definitions: Lecture Psychology 280 Orange Coast College 2/1/2006 Statistics have been defined as a collection of methods for planning experiments, obtaining data, and then analyzing, interpreting and

More information

Six Sigma Glossary Lean 6 Society

Six Sigma Glossary Lean 6 Society Six Sigma Glossary Lean 6 Society ABSCISSA ACCEPTANCE REGION ALPHA RISK ALTERNATIVE HYPOTHESIS ASSIGNABLE CAUSE ASSIGNABLE VARIATIONS The horizontal axis of a graph The region of values for which the null

More information

A Modified CATSIB Procedure for Detecting Differential Item Function. on Computer-Based Tests. Johnson Ching-hong Li 1. Mark J. Gierl 1.

A Modified CATSIB Procedure for Detecting Differential Item Function. on Computer-Based Tests. Johnson Ching-hong Li 1. Mark J. Gierl 1. Running Head: A MODIFIED CATSIB PROCEDURE FOR DETECTING DIF ITEMS 1 A Modified CATSIB Procedure for Detecting Differential Item Function on Computer-Based Tests Johnson Ching-hong Li 1 Mark J. Gierl 1

More information

A BAYESIAN SOLUTION FOR THE LAW OF CATEGORICAL JUDGMENT WITH CATEGORY BOUNDARY VARIABILITY AND EXAMINATION OF ROBUSTNESS TO MODEL VIOLATIONS

A BAYESIAN SOLUTION FOR THE LAW OF CATEGORICAL JUDGMENT WITH CATEGORY BOUNDARY VARIABILITY AND EXAMINATION OF ROBUSTNESS TO MODEL VIOLATIONS A BAYESIAN SOLUTION FOR THE LAW OF CATEGORICAL JUDGMENT WITH CATEGORY BOUNDARY VARIABILITY AND EXAMINATION OF ROBUSTNESS TO MODEL VIOLATIONS A Thesis Presented to The Academic Faculty by David R. King

More information

Detection of Differential Test Functioning (DTF) and Differential Item Functioning (DIF) in MCCQE Part II Using Logistic Models

Detection of Differential Test Functioning (DTF) and Differential Item Functioning (DIF) in MCCQE Part II Using Logistic Models Detection of Differential Test Functioning (DTF) and Differential Item Functioning (DIF) in MCCQE Part II Using Logistic Models Jin Gong University of Iowa June, 2012 1 Background The Medical Council of

More information

Choosing the Correct Statistical Test

Choosing the Correct Statistical Test Choosing the Correct Statistical Test T racie O. Afifi, PhD Departments of Community Health Sciences & Psychiatry University of Manitoba Department of Community Health Sciences COLLEGE OF MEDICINE, FACULTY

More information

THE APPLICATION OF ORDINAL LOGISTIC HEIRARCHICAL LINEAR MODELING IN ITEM RESPONSE THEORY FOR THE PURPOSES OF DIFFERENTIAL ITEM FUNCTIONING DETECTION

THE APPLICATION OF ORDINAL LOGISTIC HEIRARCHICAL LINEAR MODELING IN ITEM RESPONSE THEORY FOR THE PURPOSES OF DIFFERENTIAL ITEM FUNCTIONING DETECTION THE APPLICATION OF ORDINAL LOGISTIC HEIRARCHICAL LINEAR MODELING IN ITEM RESPONSE THEORY FOR THE PURPOSES OF DIFFERENTIAL ITEM FUNCTIONING DETECTION Timothy Olsen HLM II Dr. Gagne ABSTRACT Recent advances

More information

Biostatistics. Donna Kritz-Silverstein, Ph.D. Professor Department of Family & Preventive Medicine University of California, San Diego

Biostatistics. Donna Kritz-Silverstein, Ph.D. Professor Department of Family & Preventive Medicine University of California, San Diego Biostatistics Donna Kritz-Silverstein, Ph.D. Professor Department of Family & Preventive Medicine University of California, San Diego (858) 534-1818 dsilverstein@ucsd.edu Introduction Overview of statistical

More information

1 The conceptual underpinnings of statistical power

1 The conceptual underpinnings of statistical power 1 The conceptual underpinnings of statistical power The importance of statistical power As currently practiced in the social and health sciences, inferential statistics rest solidly upon two pillars: statistical

More information

Statistical Techniques. Masoud Mansoury and Anas Abulfaraj

Statistical Techniques. Masoud Mansoury and Anas Abulfaraj Statistical Techniques Masoud Mansoury and Anas Abulfaraj What is Statistics? https://www.youtube.com/watch?v=lmmzj7599pw The definition of Statistics The practice or science of collecting and analyzing

More information

Statistical Methods and Reasoning for the Clinical Sciences

Statistical Methods and Reasoning for the Clinical Sciences Statistical Methods and Reasoning for the Clinical Sciences Evidence-Based Practice Eiki B. Satake, PhD Contents Preface Introduction to Evidence-Based Statistics: Philosophical Foundation and Preliminaries

More information

Basic concepts and principles of classical test theory

Basic concepts and principles of classical test theory Basic concepts and principles of classical test theory Jan-Eric Gustafsson What is measurement? Assignment of numbers to aspects of individuals according to some rule. The aspect which is measured must

More information

Results & Statistics: Description and Correlation. I. Scales of Measurement A Review

Results & Statistics: Description and Correlation. I. Scales of Measurement A Review Results & Statistics: Description and Correlation The description and presentation of results involves a number of topics. These include scales of measurement, descriptive statistics used to summarize

More information

02a: Test-Retest and Parallel Forms Reliability

02a: Test-Retest and Parallel Forms Reliability 1 02a: Test-Retest and Parallel Forms Reliability Quantitative Variables 1. Classic Test Theory (CTT) 2. Correlation for Test-retest (or Parallel Forms): Stability and Equivalence for Quantitative Measures

More information

Ecological Statistics

Ecological Statistics A Primer of Ecological Statistics Second Edition Nicholas J. Gotelli University of Vermont Aaron M. Ellison Harvard Forest Sinauer Associates, Inc. Publishers Sunderland, Massachusetts U.S.A. Brief Contents

More information

Scoring Multiple Choice Items: A Comparison of IRT and Classical Polytomous and Dichotomous Methods

Scoring Multiple Choice Items: A Comparison of IRT and Classical Polytomous and Dichotomous Methods James Madison University JMU Scholarly Commons Department of Graduate Psychology - Faculty Scholarship Department of Graduate Psychology 3-008 Scoring Multiple Choice Items: A Comparison of IRT and Classical

More information

Psychology Research Process

Psychology Research Process Psychology Research Process Logical Processes Induction Observation/Association/Using Correlation Trying to assess, through observation of a large group/sample, what is associated with what? Examples:

More information

CHAPTER VI RESEARCH METHODOLOGY

CHAPTER VI RESEARCH METHODOLOGY CHAPTER VI RESEARCH METHODOLOGY 6.1 Research Design Research is an organized, systematic, data based, critical, objective, scientific inquiry or investigation into a specific problem, undertaken with the

More information

The degree to which a measure is free from error. (See page 65) Accuracy

The degree to which a measure is free from error. (See page 65) Accuracy Accuracy The degree to which a measure is free from error. (See page 65) Case studies A descriptive research method that involves the intensive examination of unusual people or organizations. (See page

More information

Chapter 6 Topic 6B Test Bias and Other Controversies. The Question of Test Bias

Chapter 6 Topic 6B Test Bias and Other Controversies. The Question of Test Bias Chapter 6 Topic 6B Test Bias and Other Controversies The Question of Test Bias Test bias is an objective, empirical question, not a matter of personal judgment. Test bias is a technical concept of amenable

More information

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests David Shin Pearson Educational Measurement May 007 rr0701 Using assessment and research to promote learning Pearson Educational

More information

Bayesian Tailored Testing and the Influence

Bayesian Tailored Testing and the Influence Bayesian Tailored Testing and the Influence of Item Bank Characteristics Carl J. Jensema Gallaudet College Owen s (1969) Bayesian tailored testing method is introduced along with a brief review of its

More information

INTRODUCTION TO STATISTICS SORANA D. BOLBOACĂ

INTRODUCTION TO STATISTICS SORANA D. BOLBOACĂ INTRODUCTION TO STATISTICS SORANA D. BOLBOACĂ OBJECTIVES Definitions Stages of Scientific Knowledge Quantification and Accuracy Types of Medical Data Population and sample Sampling methods DEFINITIONS

More information

Shiken: JALT Testing & Evaluation SIG Newsletter. 12 (2). April 2008 (p )

Shiken: JALT Testing & Evaluation SIG Newsletter. 12 (2). April 2008 (p ) Rasch Measurementt iin Language Educattiion Partt 2:: Measurementt Scalles and Invariiance by James Sick, Ed.D. (J. F. Oberlin University, Tokyo) Part 1 of this series presented an overview of Rasch measurement

More information

Influences of IRT Item Attributes on Angoff Rater Judgments

Influences of IRT Item Attributes on Angoff Rater Judgments Influences of IRT Item Attributes on Angoff Rater Judgments Christian Jones, M.A. CPS Human Resource Services Greg Hurt!, Ph.D. CSUS, Sacramento Angoff Method Assemble a panel of subject matter experts

More information

Psychology of Perception Psychology 4165, Fall 2001 Laboratory 1 Weight Discrimination

Psychology of Perception Psychology 4165, Fall 2001 Laboratory 1 Weight Discrimination Psychology 4165, Laboratory 1 Weight Discrimination Weight Discrimination Performance Probability of "Heavier" Response 1.0 0.8 0.6 0.4 0.2 0.0 50.0 100.0 150.0 200.0 250.0 Weight of Test Stimulus (grams)

More information

Analysis and Interpretation of Data Part 1

Analysis and Interpretation of Data Part 1 Analysis and Interpretation of Data Part 1 DATA ANALYSIS: PRELIMINARY STEPS 1. Editing Field Edit Completeness Legibility Comprehensibility Consistency Uniformity Central Office Edit 2. Coding Specifying

More information

WDHS Curriculum Map Probability and Statistics. What is Statistics and how does it relate to you?

WDHS Curriculum Map Probability and Statistics. What is Statistics and how does it relate to you? WDHS Curriculum Map Probability and Statistics Time Interval/ Unit 1: Introduction to Statistics 1.1-1.3 2 weeks S-IC-1: Understand statistics as a process for making inferences about population parameters

More information

Measuring mathematics anxiety: Paper 2 - Constructing and validating the measure. Rob Cavanagh Len Sparrow Curtin University

Measuring mathematics anxiety: Paper 2 - Constructing and validating the measure. Rob Cavanagh Len Sparrow Curtin University Measuring mathematics anxiety: Paper 2 - Constructing and validating the measure Rob Cavanagh Len Sparrow Curtin University R.Cavanagh@curtin.edu.au Abstract The study sought to measure mathematics anxiety

More information

Adaptive EAP Estimation of Ability

Adaptive EAP Estimation of Ability Adaptive EAP Estimation of Ability in a Microcomputer Environment R. Darrell Bock University of Chicago Robert J. Mislevy National Opinion Research Center Expected a posteriori (EAP) estimation of ability,

More information

Investigating the robustness of the nonparametric Levene test with more than two groups

Investigating the robustness of the nonparametric Levene test with more than two groups Psicológica (2014), 35, 361-383. Investigating the robustness of the nonparametric Levene test with more than two groups David W. Nordstokke * and S. Mitchell Colp University of Calgary, Canada Testing

More information

Instrument equivalence across ethnic groups. Antonio Olmos (MHCD) Susan R. Hutchinson (UNC)

Instrument equivalence across ethnic groups. Antonio Olmos (MHCD) Susan R. Hutchinson (UNC) Instrument equivalence across ethnic groups Antonio Olmos (MHCD) Susan R. Hutchinson (UNC) Overview Instrument Equivalence Measurement Invariance Invariance in Reliability Scores Factorial Invariance Item

More information

MBA 605 Business Analytics Don Conant, PhD. GETTING TO THE STANDARD NORMAL DISTRIBUTION

MBA 605 Business Analytics Don Conant, PhD. GETTING TO THE STANDARD NORMAL DISTRIBUTION MBA 605 Business Analytics Don Conant, PhD. GETTING TO THE STANDARD NORMAL DISTRIBUTION Variables In the social sciences data are the observed and/or measured characteristics of individuals and groups

More information

MEA DISCUSSION PAPERS

MEA DISCUSSION PAPERS Inference Problems under a Special Form of Heteroskedasticity Helmut Farbmacher, Heinrich Kögel 03-2015 MEA DISCUSSION PAPERS mea Amalienstr. 33_D-80799 Munich_Phone+49 89 38602-355_Fax +49 89 38602-390_www.mea.mpisoc.mpg.de

More information