MEASUREMENT DISTURBANCE EFFECTS ON RASCH FIT STATISTICS AND THE LOGIT RESIDUAL INDEX DISSERTATION. Presented to the Graduate Council of the

Size: px

Start display at page:

Download "MEASUREMENT DISTURBANCE EFFECTS ON RASCH FIT STATISTICS AND THE LOGIT RESIDUAL INDEX DISSERTATION. Presented to the Graduate Council of the"

Duane McDowell
5 years ago
Views:

1 Vrv^ MEASUREMENT DISTURBANCE EFFECTS ON RASCH FIT STATISTICS AND THE LOGIT RESIDUAL INDEX DISSERTATION Presented to the Graduate Council of the University of North Texas in Partial Fulfillment of the Requirements For the Degree of DOCTOR OF PHILOSOPHY By Robert E. Mount, A.A., B.S., M.A., C.R.C. Denton, Texas August, 1997

2 Mount, Robert E., Measurement disturbance effects on Rasch statistics and the Residual Index. Doctor of Philosophy (Educational Research), August, 1997, 194 pp., 15 tables, 2 illustrations, references, 32 titles. The effects of random guessing as a measurement disturbance on Rasch statistics (unweighted, weighted, and unweighted ability between) and the Residual Index (LRI) were examined through simulated data sets of varying sample sizes, test lengths, and distribution types. Three test lengths (25, 50, and 100), three sample sizes (25, 50, and 100), two difficulty distributions (normal and uniform), and three levels of guessing (no guessing [0%], 25%, and 50%) were used in the simulations, resulting in 54 experimental conditions. The mean logit person ability for each experiment was +1. Each experimental condition was simulated once in an effort to approximate what could happen on the single administration of a four option per multiple choice test to a group of relatively high ability persons. Previous research has shown that varying and person parameters have no effect on Rasch statistics. Consequently, these parameters were used in the present study to establish realistic test conditions, but were not interpreted as effect factors in determining the results of this study. Rasch statistics were found to be robust to varying levels of guessing and to distribution types. The unweighted statistic was more sensitive to problems far away from the average ability of the group in which the problems occurred ( problems away from the difficulties). The weighted statistic was more

3 sensitive to problems centered on the difficulties. It was also found that, as the probability of guessing the correct answer increased, low-ability persons tended consistently to guess the correct answer inducing bias ( familiarity) into the tests. These s were detected by the unweighted between statistic. In conditions involving minor problems and misting s, the LRI was able to identify group membership of the persons in which the problem occurred. Therefore, it is necessary to use the unweighted, weighted, and between statistics in combination with the LRI to diagnose problems for a more accurate assessment of individual differences.

4 Vrv^ MEASUREMENT DISTURBANCE EFFECTS ON RASCH FIT STATISTICS AND THE LOGIT RESIDUAL INDEX DISSERTATION Presented to the Graduate Council of the University of North Texas in Partial Fulfillment of the Requirements For the Degree of DOCTOR OF PHILOSOPHY By Robert E. Mount, A.A., B.S., M.A., C.R.C. Denton, Texas August, 1997

5 TABLE OF CONTENTS Page LIST OF TABLES LIST OF ILLUSTRATIONS v vii Chapter 1. INTRODUCTION 1 Overview Properties of Estimators Rasch Estimation Methods Rationale for the Study Research Question Definition of Terms Delimitations 2. REVIEW OF THE LITERATURE 15 Historical Perspective Measurement Disturbances Rasch Fit Statistics Residual Index Purpose of Study 3. METHODS AND PROCEDURES 33 Data Set Construction Simulated Data Sets Rasch Analysis Statistical Analysis 4. RESULTS 39 Effect of Guessing on Rasch Fit Statistics Simulation Design Effects

6 Table of Content (continued) Chapter Page Detection of Guessing by Rasch Fit Statistics Guessing and the Residual Index 5. FINDINGS AND CONCLUSIONS 59 APPENDIX Effect of Guessing on Rasch Fit Statistics Detection of Guessing by Rasch Fit Statistics Guessing and the Residual Index Summary Conclusions Further Study Recommendations A IP ARM Control File Parameters 72 B A Summary of Item Fit Information by Experiment 76 C A Summary of Misting Item Statistics by Experiment 83 REFERENCES 192

7 LIST OF TABLES Table Page 1. Definition of Experiments Mean Summary of Item Fit Information for Experiments 1-9 (Normally Distributed Item Difficulty Distributions and No Guessing) Mean Summary of Item Fit Information for Experiments (Normally Distributed Item Difficulty Distributions and a 25% Chance of Guessing Correctly) Summary of Mean Item Fit Information for Experiments (Normally Distributed Item Difficulty Distributions and a 50% Chance of Guessing Correctly) Mean Differences for Experiments With Normally Distributed Item Difficulties With No Guessing, 25%, and a 50% Chance of Guessing Correctly Summary of Mean Item Fit Information for Experiments (Uniformly Distributed Item Difficulty Distributions and No Guessing) Summary of Mean Item Fit Information for Experiments (Uniformly Distributed Item Difficulty Distributions and a 25% Chance of Guessing Correctly) Summary of Mean Item Fit Information for Experiments (Uniformly Distributed Item Difficulty Distributions and a 50% Chance of Guessing Correctly) Mean Differences for Experiments With Uniformly Distributed Item Difficulties With No Guessing, 25%, and a 50% Chance of Guessing Correctly 47

8 List of Tables (continued) Table Page 10. Mean Differences for Experiments With Normal and Uniformly Distributed Item Difficulties and Varying Levels of Guessing (No Guessing, 25%, and 50%) A Comparison of the Frequency of Misting Items Detected by Rasch Fit Statistics in a Normal Distribution of Item Difficulties at Varying Test Lengths and Levels of Guessing Using % A Comparison of the Frequency of Misting Items Detected by Rasch Fit Statistics in a Uniform Distribution of Item Difficulties at Varying Test Lengths and Levels of Guessing Using x A Comparison of the Frequency of Misting Items Detected by Rasch Fit Statistics in Normal and Uniformly Distributed Item Difficulties Using % Number and Percent of Misting Items Detected by Rasch Fit Statistics Across Experiments Involving Normal and Uniformly Distributed Item Difficulties Number of Misting Items and Number and Percent of LRI Values by Experimental Conditions 56

9 LIST OF ILLUSTRATIONS Figure Page 1. Rasch Model Notations 7 2. Table Format for the Display of Experimental Data 40

10 CHAPTER 1 INTRODUCTION Overview Stevens (1946) defined measurement as the assignment of numerals to objects or events according to rules. However, the measurement of individual differences is not as straightforward as the definition implies; it is fraught with several measurement problems: (a) No one approach is universally acceptable; (b) measurements are based on limited samples of behaviors; (c) measurements are subject to error; and (d) the units of measurement are not well defined (Crocker & Algina, 1986). Of the problems associated with the measurement of individual differences, the accuracy of measurement is probably the most important. A measure can only be as accurate as the ruler (scale) used to obtain the measurement. Traditionally, the measurement of individual differences was based on the classical true score theory and its associated statistics and scales of measurement (nominal, ordinal, interval, and ratio). The irony is that measurements obtained using this approach were found to be sample specific and that generalizations beyond the reference sample should be made with caution. In addition, the scales of measurement were found to have arbitrary zero points and unequal measurement intervals. Therefore, measurements obtained using these scales may be as arbitrary and unequal as the ruler

11 used to make them. Given these characteristics, estimates (statistics) obtained in identifying individual differences tended to be biased, inconsistent, insufficient, and inefficient when used with nonnormal distributions. What was needed was a true linear scale that had an absolute zero point and equal interval units of measurement. In 1960, one such scale with associated statistics, was developed by Rasch (Rasch, 1980). Known as Rasch Analysis, this approach allowed for the independent estimation of person ability and difficult parameters. In addition, the statistics used in this approach were found to be consistent, efficient, sufficient, and unbiased. In the assessment of function in the Rasch model, one observes the residual trends among ability groups. The Residual Index (LRI), a statistic introduced by Smith (1991b), is a measure of how far an deviates from the common slope that is ted for all s (model curve) and the residual trend among ability groups. Thus, s with LRI values greater than zero will have an characteristic curve (ICC) that is steeper than the modeled, curve and s with values less than zero will have an ICC that is flatter than the modeled curve. The residual trend is an indication of how well, or less well, different ability groups performed on an individual. The purpose of this investigation is to test the effects of varying test parameters and levels of measurement disturbance on Rasch statistics and the Residual Index.

12 Properties of Estimators Rarely, if ever, are characteristics about a complete population known; therefore, researchers make inferences about a population based on a representative random sample taken from the population. A population refers to all members in the entire group having some common characteristic. For example, a population may include all members in a classroom, school, city, community, county, state, nation, or the world. As group membership increases, it becomes increasingly more difficult to obtain measures on all characteristics due to restrictions in time and costs, or the population size increases too rapidly. A population may be finite, a known or countable number of members, or infinite, a population that is so large that group membership is not known. Characteristics about a population are called parameters and characteristics about a sample are called statistics. The most commonly used estimates about populations are measures of central tendency (mean, median, and mode). They identify the most typical measures in a normal distribution. The arithmetic mean is the most commonly used measure of central tendency. It is a simple arithmetic average determined by summing all scores in a distribution of scores and dividing by the number of scores. The mean is an appropriate measure of central tendency when the score distribution is normal and the level of measurement is on the interval or ratio scale. The median is the midpoint in a distribution of scores when arranged in order of magnitude. Stated differently, it is the point on the score scale below which 50 % of the scores fall. It is also equivalent to the 50th percentile. The median is

13 an appropriate measure of central tendency when the level of measurement is on the ordinal scale and the score distributions are other than normal. The mode is the most frequently occurring score in a distribution of scores. However, it is not a dependable measure of central location. Depending on the shape of the distribution, it is possible to have two (bimodal) or more modes (multimodal). The mode is an appropriate measure of central tendency when the level of measurement is on the nominal scale. In a normally distributed population of scores, the mean, median, and mode will coincide or be the same value. However, in nonnormal distributions, these values differ, and, therefore, certain estimators have more desirable properties than others. The desirable properties of estimators are consistency, efficiency, sufficiency, and unbiasness. These serve as criteria for determining preferences for one method of estimation over another. An estimator is considered to be unbiased when the mean of a sampling distribution of means approaches that of the population parameter as the number of samples of a given size increases. That is, a statistic is unbiased when it shows no systematic tendency to be either greater than or less than the population parameter. For example, it can be shown that the variance S 2 =E(X-X) 2 /n is a biased estimate of the population variance (cr 2 ) (Ferguson, 1981).

14 Consistency implies that an estimator more closely approximates a population parameter as sample size increases. The arithmetic mean is a prime example of a consistent estimator. It more closely approximates the population parameter as the sample size increases. Efficiency is implied by sampling variance. It refers to the variability of estimates from sample to sample, or the degree of sampling error associated with the estimator. That is, if the sampling error is less than the sampling error associated with any other method of estimation, the estimate is considered to be efficient (Ferguson, 1981). More explicitly, the estimator that has the smallest standard error is more efficient (Walker & Lev, 1953). An estimator is sufficient for estimating a population parameter if it exhausts all the information about the population parameter from sample data. For example, the mean is a sufficient estimator of the population mean (JJ,), because, once the sample mean is known, any other statistic computed from the sample data (such as the median or mode) would provide no further information about the population mean (Neter, Wasserman, & Whitmore, 1978). Given a normal population, the mean provides an estimate of (j, that satisfies all the desirable properties of a good estimator (consistent, efficient, sufficient, and unbiased) (Walker & Lev, 1953). Rasch Estimation Methods Measurement implies the determination of the quantity, quality, or some other characteristic of an object or attribute. It answers the questions concerning how many,

15 how often, and how much of a particular object or attribute exist. The process involves the assignment of units or numbers in a logical fashion along a dimension or scale. When an object or attribute is measured, it is assigned a specific position along a dimension or numerical scale. Traditionally, we have used four scales or levels of measurement: (a) nominal scale-when numbers, names, or words are used to identify or label individuals or objects; (b) ordinal scale when numbers or words reflect the order of things, (c) interval scale-has equal units of measurement and an arbitrary zero point; and (d) ratio scale has equal units of measurement and a true or absolute zero point. Each scale of measurement has its own rules and makes different assumptions about the measurement process. These scales are prevalent in the traditional classical true score approach to the measurement of human traits. The Rasch measurement model, unlike the classical true score model, attempts to explain the effect of a person's ability on performance. The Rasch model frees the estimation of a person's ability from the difficulty, and the estimation of the difficulty is freed from the person's ability. In short, the more able the person, the better the chances for success on an, and the easier the, the more likely a person is to solve it (Wright & Stone, 1979). It has been shown that no other mathematical model allows the estimation of person ability measures (P v ) and difficulty calibrations^) independent of each other (Anderson, 1973; Barndorff-Nielsen, 1978; Rasch, 1961; Wright & Stone, 1979). The logistic function (probability of a correct response) in the Rasch model,

16 p {*vi- 11 Pv.8,! = exp ( 3 V - 8,)/[l + exp(fi v - 6,)], P b th linearity of scale and generality of measure (Wright & Stone, 1979). Rasch called this particular characteristic "specific objectivity." The symbols and associated definitions used in the Rasch model are presented in Figure 1. = = = = = = = = = = = = = = = ===== 1 r ~ ability level 8 difficulty level " r v test score of person v " L the number of s in the test H the average difficulty level of the test M mean person ability t the variance in difficulties of the test s i an on the test Pi sample p-value of an i V person Pv ability level of person r score on the Test individual difficulty 8i difficulty level person response X vi Figure 1. Rasch model notations. Several Rasch measurement models have been identified: (a)ratingscale, (b) poisson, (c) binominal, (d) dichotomous, (e) partial credit, and (f) many faceted. However, for a measurement model to wo*, there must be some method of estimating its parameters. Rasch identified six estimation methods: (a) the LOG method, (b) the PAIR method, (c) the FCON method, (d) the UCON method, (e) the PROX method, and (f) the UFORM method. Of the six estimation methods, PROX is the only estimation procedure m which and person parameters can be easily calculated by hand.

17 PROX is a normal approximation estimation procedure that expresses difficulty calibrations and person ability measures on a common linear scale. This measurement unit is called a logit. The procedure assumes that: (a) person abilities (P v ) are more or less normally distributed [with mean (jj) and standard deviation (cr)], and (b) difficulties (8j) are assumed to be more or less normally distributed with average difficulty (H) and standard deviation (co). Consequently, the effects of the sample on difficulty calibrations and that of test length on person ability measures can be summarized by means and standard deviations on the variable being measured (Wright & Masters, 1982). The PROX estimation procedure frees the scores from the effects of sample size and test length, then transforms them into a logit measure (Wright & Stone, 1979). The PROX estimation of a person's ability can be found without iteration as b v = H + (1 + (<o 2 /2.89) K In [r v /(L - r v )], with a standard error of SE(b v ) = (1 + (CO 2 /2.89) k [L/r v (l - r v )f, a test height of L H = J] di / L, i and a variance estimate of Item difficulty dj can be found as L fi> 2 = (J> 2 i -LH 2 )/(L-1).

18 d, = M + [1 + O 2 /2.89f In [(1 - Pi)/ Pi ] with a standard error of SE(dj) = (1 + O' 2 /2.89)' /2 [l/n Pi /(l - Pi)]* The Rasch model uses the logit function, In [(1 - PiVPi], to transform the p-value into a linear equal interval scale. In the Rasch model, Pj is calculated as: Pi = Sj/N, where S, is the number of satisfactory responses (correct answers) and N the number of persons. The PROX estimation method is most appropriately used for calibrating new s, because difficulties among a sample of new s tend to approximate a normal distribution, and a sample of persons tends to be normally distributed (Wright & Stone, 1979). It has been found that Rasch estimation methods are unbiased, consistent, efficient, and sufficient (Anderson, 1973; Andrich, 1988; Haberman, 1977; Wright, 1977; Wright & Stone, 1979). Therefore, the Rasch estimation methods are preferred over those of the traditional classical true score model. In the traditional classical true score approach, a person's ability is based on a test score, usually expressed as correct, a percentage of 100, or a percentile rank. The test score has been shown to

19 10 reflect ordinal and curvilinear characteristics which are not conducive to meaningful interpretation. Rationale for Study The measurement of individual differences based on the classical true score approach and its associated statistics has been found to be biased, inefficient, inconsistent, and insufficient, especially with nonnormal distributions. The most often used scales of measurement (nominal, ordinal, and interval) have arbitrary zero points and unequal measurement intervals, and the results are sample specific. In addition, a single error term (standard error of measurement) is used for all examinees (Allen & Yen, 1979). Consequently, there is no way to identify specific objectivity among s and persons (independence of person ability and difficulties) using the classical true score approach. Test data are useful only if there is some correspondence between the s on the test and the latent trait being measured. In addition, the data should the measurement model used in constructing the test. In the classical true score approach, chi-square (% 2 ) and the point biserial correlation are used as indexes of goodness of. Chi-square is used to test the difference between observed and expected events, and the point biserial correlation is sometimes used as an index of for s on a test. The problem with % as a index is that there are different sampling distributions as the degrees of freedom change. With the point biserial correlation, it is unclear what magnitude of value is needed to establish an

20 11 acceptable, mid it is sensitive to the score distribution of the sample. The Rasch model has overcome many of the problems associated with the classical true score approach: 1. A standard error of measurement is provided for each examinee and. 2. The standard error of measurement can be tested for significance. 3. The measurement scale has an absolute zero point and equal interval units. 4. The and person parameters are independent. 5. The parameter estimators are unbiased, consistent, efficient, and sufficient. Given these advantages, the Rasch model provides a more accurate estimate of a person's ability than does the classical true score approach (Allen & Yen, 1979; Wright & Stone, 1979), and it allows for the independent diagnosis of measurement problems associated with s, persons, and by person interactions. Although several research studies have shown the effects of guessing on Rasch statistics, no studies have used the Residual Index (LRI) in conjunction with statistics in helping to identify the effects of measurement disturbances. The purpose of this Monte Carlo simulation is to investigate the effects of varying test parameters and levels of measurement disturbance on Rasch statistics and the Residual Index in the detection of misting s in the single administration of a four option per multiple choice test as experienced in a classroom situation. Rasch statistics are applied to simulated data sets of varying sample sizes, test lengths, difficulty distributions, and levels of measurement disturbance.

21 12 Research Questions Although Rasch estimation models possess the desirable characteristics or properties of estimators, what effects do measurement disturbances have on Rasch statistics and the LRI when varying test lengths, sample sizes, difficulty distributions, and levels of guessing as a measurement disturbance? To test the effects of guessing and varying test parameters on Rasch statistics and the LRI, the following research question is addressed: 1. What effect does medium and high levels of guessing have on Rasch statistics and the LRI when varying sample sizes, test lengths, and difficulty distributions? Definition of Terms Chi square (x )~a descriptive measure of the magnitude of the discrepancies between the observed and expected frequencies. Consistency-an estimate more closely approximates the population parameter as the sample size increases. Estimator a statistic used to determine some characteristic about a population or sample. Efficiency the sampling error associated with a given estimator is less than the sampling error associated with any other method of estimation. Fit the degree to which measurement data approximate the assumptions or characteristics of a particular measurement model. Latent Trait-an ability or characteristic possessed by an individual that cannot be directly observed. In the weighted statistic. -a unit of measurement used in the Rasch model.

22 13 Residual Index-the sum of the chi-squares for an individual. Provides an indication of variations (steepness or flatness) in the characteristic curve. Measurement-the assignment of numerals to objects or events according to rules. Measurement Disturbances-conditions that interfere with the measurement of some underlying psychological construct (aptitude, ability, or attitude). Out-the unweighted statistic. Over s with negative statistics and steeper observed characteristic curves than predicted. Parameters characteristics about a population. Person ability-the amount of a specific trait possessed by an individual that enables that person to answer a test question correctly. Plodding to work slowly on a test and run out of time before attempting all s. Point biserial correlation-the correlation between a continuous variable and a dichotomous variable (correlation between an score and the test score). Population-all members in an entire group having some common characteristic. PROX-a normal approximation estimation procedure that expresses difficulty calibrations and person ability measures on a common linear scale. Start-up-reduced performances at the beginning of a test due to unfamiliarity, anxiety, and so on. Statistics characteristics about a sample. Unbiased-a statistic shows no systematic tendency to be either greater than or less than the population parameter. Under-s with positive statistics and flatter observed characteristic curves than predicted.

23 14 Delimitations For this study, specific parameters related to test lengths, sample sizes (examinees), and difficulty distributions were selected based on a review of previous research. The study is limited to three sample sizes (25, 50, and 100), three test lengths (25, 50, and 100), three levels of guessing (no guessing [0%], 25%, and 50%), and two difficulty distributions (normal and uniform). In addition, the results are based on experimental conditions that simulate the single administration of four option per multiple choice tests as experienced in classroom situations.

24 CHAPTER 2 REVIEW OF THE LITERATURE Historical Perspective "Thorndike (1918) said, whatever exists at all exists in some amount. To know it thoroughly involves knowing its quantity as well as its quality" (Crocker & Algina 1986, p. 4). Psychological constructs, however, are hypothetical abstractions that can be observed only indirectly and the existence of which can never be folly confirmed. Stevens (1946) defined measurement as the assignment of numerals to objects or events according to rules hence, the use of nominal, ordinal, interval, and ratio scales of measurement. Lord and Novick (1968) and Torgerson (1958) noted that measurement applied to the properties of objects rather than the objects themselves. Accordingly, when we measure an individual or object, we are measuring not the object or person, but rather the properties that define the construct or variable possessed by the person whose performance is being measured. Thus, the measurement of such abstractions presents several problems (Crocker & Algina, 1986). To empirically investigate the existence of a trait or property, it is necessary to develop a test theory to guide the investigation. Based upon test theory, we develop tests, the primary tools by which we collect information about individual differences. However, before any measurement can be made, an operational definition of the variable

25 16 of interest must be established. In other words, we must establish some correspondence between the test s and the construct being measured. This correspondence is known as establishing an operational definition. In the literature, this is sometimes referred to as -objective congruence (Crocker & Algina, 1986). The operational definition or common line of inquiry allows the test to define the variable being measured and provide a means for estimating the location of the person taking the test along an ability continuum based on his/her test score (Wright & Stone, 1979). Test scores are meaningful only if (a) they relate to some scale of measurement; (b) they are generalizable beyond the test; and (c) the response pattern is consistent with expectations. Binet, Thurstone, Thorndike, Stevens and others were among the first to develop scales of measurement (Crocker & Algina, 1986). These scales provided rules and meaningful units of measurement for the comparison of individual s and persons in the assessment of individual differences. These scales are predominately used in the assessment of individual differences in the classical true score approach to measurement. In the classical true score approach, a person's observed score (X) is the sum of two unobservable scores, a true score (T) and an error score (E). The observed score is defined as X =1 +E. The true score is defined as the average score resulting from an infinite number of repeated testing with the same instrument. The error score is the difference between the observed score and the true score. Measurements obtained using this approach are

26 17 usually expressed as correlations, percentile ranks, z-scores, t-scores, or scaled scores. It has also been shown that these statistics are sample specific. The disadvantages of the classical true score approach are that a single error term (standard error of measurement) is used for all examinees and that the difficulties are related to the number of persons who answered the s correctly. It has also been shown that, as the reference group changes, so does the measured performance of the person taking the test. To what degree of certainty then are these measurements valid for generalizations beyond the reference sample? What was needed was a test theory or model that allowed for the separate and independent estimation of both and person parameters, something the classical true score approach did not take into account. One such model was introduced by Rasch in 1960 along with a true linear scale of measurement (Rasch, 1980). When using the Rasch model, generalizations beyond the test are based on several assumptions: 1. The test theory used in the development of the test is appropriate. 2. The s on the test define the variable being measured. 3. The test score gives us some indication of the properties that define the variable that is possessed by the person taking the test. 4. The scale of measurement is linear. 5. The and person parameters are independent. Thus, when the Rasch model is ted properly, the criteria of independence of sample and s were satisfied, and generalizations beyond the test can be sufficiently made.

27 18 Measurement Disturbances Measurement disturbances are conditions that interfere with the measurement of some underlying psychological construct, such as aptitude, ability, or attitude (Smith, 1991b). With respect to the Rasch model, only two conditions determine the outcome of the interaction between the person and any on the test: (a) the amount of the trait possessed by the person and (b) the amount of the trait necessary to provide a certain response to a given stimulus (Smith, 1991b). These conditions are commonly referred to as P erson ability and difficulty. Any other condition that influences measurement is considered noise in the measurement process. Measurement disturbances that are characteristics of the person and independent of the s include, but are not limited to (a) start-up, '(b) plodding, (c) cheating, (d) illness, (e) external distractions, (f) guessing, (g) boredom, and (h) fatigue (Smith, 1991b). Measurement disturbances associated with the interaction of the person and the properties of the s are (a) guessing, (b) sloppiness, (c) content, (d) type, and (e) bias. Examples of the third type of measurement disturbance may include such things as typographical errors, unrelated answer choices, and s unrelated to content. The most common types of measurement disturbances are cheating and guessing (Smith, 1991b). Thorndike (1949) developed a list of possible disturbances to the measurement process. Smith (1985) later classified measurement disturbances into three general categories: (a) disturbances that are the results of characteristics of the person that are

28 19 independent of the s, (b) disturbances that are the interaction between the characteristics of the person and the properties of the s, and (c) disturbances that are the results of the properties of the s that are independent of the characteristics of the person. The classification of measurement disturbances is important in that the source of measurement disturbances dictates the techniques necessary to detect its presence (Smith, 1991b). Glaser (1949,1952) and Mosier (1941) felt that a person would exhibit consistently correct answers to relatively easy s, consistently incorrect responses to difficult s, and inconsistent responses to s centered on their ability level. Since inconsistent responses could be associated with measurement disturbances, Thurstone and Chave (1929) believed that some criterion should be established such that inconsistent records should be eliminated from the tabulation. Thus, persons with perfect scores (all correct) and persons with no s correct (score of zero) are eliminated from Rasch and person analysis. The detection of measurement disturbances can be divided into two general categories: an investigation of the structure of the entire response matrix (an investigation of the of the responses to individual s [ ]), and an investigation of the of the responses for an individual person (person ) (Smith, 1991b). Once a measurement disturbance has been detected, there are four possible responses: (a) ignore the problem, (b) assume everyone has the problem and make a correction, (c) use some method of robust estimation, or (d) use the available information about the s and the people to

29 20 make a systematic analysis of each individual's response patterns. If a measurement disturbance is noticed with person analysis, there are four possible alternative actions: (a) accept the original estimate, (b) modify the response pattern and re-estimate ability, (c) report only subset ability estimates and no ability, or (d) decide that there is not enough information in the responses to report any ability estimate (Smith, 1991b). An analysis of for the entire response matrix does not require additional information about the s or persons. It can be based solely on the observed responses, but it is more useful when based on some characteristic of the persons (age, gender, native language, or ethnic origin). These characteristics can be used to create subgroups of persons that can be used to test the invariance of the difficulty parameters. Person analysis is more useful when based on groups of s that have the potential to evoke measurement noise in certain groups of persons. However, there are some measurement disturbances that cannot be easily identified in either s or persons (Smith, 1991b). In 1982 Smith compared the weighted and unweighted between statistics as applied to persons and found that (a) the mean and the standard deviation of the two statistics were almost identical; (b) the correlation between the two statistics was very high (r =.99); (c) the Type I error rates were almost identical; and (d) there was high correspondence between s and persons identified as misting by the two statistics (Smith, 1991b). Anderson (1973), Gustafsson (1980), and Wollenberg (1982) suggested the use of the likelihood ratio chi-square test as an alternative to the between statistic because the

30 21 distributional properties of the Pearson chi-square are not known. Smith and Hedges (1982) demonstrated that the distributions of the Pearson chi-square and the likelihood ratio chi-square were almost identical in data simulated to the Rasch model and data designed to simulate various forms of measurement disturbances. Smith (1991a) examined the effects of test length, number of persons, difficulty distributions, person abilities, and the number of steps in each on the mean squares. The results suggested that (a) responses are discrete rather than continuous variables and have little influence on the distribution of the statistics; and (b) estimated and person parameters appear to have little effect on the mean of the statistics, but seem to reduce the standard deviation. Further simulations by Smith (1991a) studied the effects of test length, number of persons, range of difficulties, and offset between person ability and difficulty distributions. The results suggested correction factors for the restrictions imposed by the use of both estimated difficulties and person abilities in the analysis. Because there was a magnitude of difference between the weighted and unweighted versions of the statistics, two correction factors were proposed. Smith (1988b) performed several simulations to assess the distributional properties of the weighted and unweighted between statistics. These simulations involved 10 replications of 1,000 persons taking a 20- test, with the difficulties uniformly distributed from -1 to +1 logits. The results showed that, as the number of ability groups increased, the mean and standard deviation of the transformed values

31 22 approached the hypothesized values of 0,1. Additional simulations studied the effect of increasing the number of persons and number of s, varying the dispersion of difficulties, and varying the offset between the mean of the and the person distributions. The results indicated that, within the ranges studied, varying these factors had little effect on the distribution of the transformed values. Thus, there appears to be no reason to develop correction factors such as those developed for the weighted and unweighted statistic to correct for the influence of these factors on the distribution and Type I error rate of the between statistics (Smith 1991a). Smith (1988a, 1991a) also studied the power of the and between statistics to detect two types of measurement disturbances, bias and guessing when unknown. These studies found that the weighted, unweighted, and between statistics were capable of detecting different types of measurement disturbances. The between statistic was more efficient at detecting bias than either the unweighted or weighted statistic. The unweighted and weighted statistics were more sensitive to disturbances such as guessing and start-up. The primary difference between the two statistics is that the unweighted version is based on the sum of the standardized residuals, whereas the weighted version is based on the sum of the standardized residuals that have been weighted by the information function. For s far away from the person's estimated ability, the weighting process makes the weighted statistic less sensitive to residuals from those s. Systematic identification and evaluation of measurement disturbances were also demonstrated by Wright (1977), Wright and Stone

32 23 (1979), and Wright and Masters (1982). Unless one is looking for a specific type of measurement disturbance, it seems necessary to use both the and between statistics in the analysis of information. Smith (1986,1988a, 1991b) and Smith and Hedges (1982) studied the power of the and between statistics to detect two types of measurement disturbances, bias and guessing. These studies found that the and between statistics were capable of detecting different types of measurement disturbances. The between statistic was more efficient at detecting bias, and the statistics were more sensitive in detecting disturbances such as guessing and start-up. Rasch Fit Statistics Prior to the development of computers, the calculation of Rasch indexes were not practical. In fact, the first statistic developed for use with Rasch calibration computer programs was the overall chi-square statistic (Smith, 1991b). This statistic was based on the Pearson chi-square typically used in the statistics developed by Wright (1977). The overall chi-square statistic was developed to be used with dichotomously scored test s to assess the of the entire data matrix to the Rasch measurement model rather than assessing the of individual s or persons (Smith, 1991b). The overall chi-square L m i=1 j=l

33 is formed by summing a version of the squared standardized residual for the entire matrix 24 where in is the number of raw score persons or groups L -1) and L is the number of s on the test with (L - l)(m -1) degrees of freedom (Smith, 1991b; Wright & Panchapakesan, 1969). The standardized residual is defined as y aij - (n)(pij) where a^ is the observed number of correct responses for persons with a raw score j, rj is the number of persons with raw score j, and Py is the probability of a correct response on that for group j (Smith, 1991b, p. 165). Anderson (1973) developed an additional index of overall based on the likelihood-ratio chi-square. Wright and Panchapakesan (1969) also developed a statistic known as the chi-square, which can be used to test the of responses to individual s. These statistics are referred to as between ability group statistics. Traditionally, the point biserial correlation offered a rough estimate of ; however, this statistic is sample specific. That is, it is dependent upon the score distribution of the sample. Anderson (1973) and Bamdorff-Nielson (1978) have shown that only difficulty is necessary for consistent and sufficient estimates. Rasch suggested several statistics to assess the of data to his measurement models, the weighted and the unweighted statistics. The weighted statistic is referred to as in and the unweighted statistic is referred to as out. In the weighted version, a greater weight is given to unexpected responses to

34 25 s near the person's logit measure (ability), and in the unweighted version, a greater weight is given to unexpected responses that are farther away from the person's logit measure (Wright & Stone, 1979). The statistics evaluate the general agreement between the variable defined by the and the variable defined by all other s over the whole sample. The weighted statistic was developed to diminish the effect of anomalous outliers, which results in unusually large mean squares (Smith, 1991b). This is evident when an unexpected number of low-ability persons answer difficult s correctly and an unexpected number of high-ability persons answer easy s incorrectly at the beginning of the test. These statistics are sensitive to measurement disturbances, such as guessing, start-up, highly discriminating s, and very easy s, but are relatively insensitive to systematic disturbances such as bias (Smith, 1991b). BICAL, a computer program used to test, was developed by Wright and Mead (1978). This program uses the unweighted versions of two statistics, the and between statistics. The between statistic is based on the number of ability groups, and the statistic is based on the /person residual rather than the /ability level residual (Smith, 1991b). Later versions of BICAL introduced a log transformation in an attempt to standardize the statistics to an approximate unit normal distribution. These transformations were introduced because the mean squares that indicated possible mis varied from to and analysis to analysis, depending i on

35 26 the number of persons, difficulty distributions, and the distribution of person abilities (Smith, 1991b). The latest version of BICAL uses a cube root transformation to convert the mean squares of the unweighted and between statistics into approximate unit normals (Smith, 1991b). However, these statistics are sensitive to start-up, guessing, large ranges of difficulties, person abilities, and easy s, thereby producing large mean squares (mis). This latest version also introduces the weighted version of the statistic, which replaced the unweighted version. Wright and Masters (1982) expanded the notion of to two polychotomous Rasch models, the rating scale and partial credit models. With this addition, Rasch statistics are now available for models with other than dichotomously scored s. MSCALE, CREDIT, BIGSCALE, and BIGSTEPS are among the most recent Rasch calibration programs. The primary purpose of these programs is to estimate and person parameters from a collection of responses to the s (Smith, 1991b). These programs contain both the unweighted and weighted statistics. These two statistics accentuate different parts of the -person relationship. Although there is a high correlation between the two statistics, the difference between the two can help diagnose different types of measurement disturbances. The statistics are more sensitive to measurement disturbances such as guessing, where unusual numbers of low-ability examinees give correct answers to difficult s, and start-up, where an unusually high number of high-ability examinees give incorrect responses to easy s

36 27 at the beginning of the test. The statistics are also sensitive to variation in the characteristic curve. IP ARM ( and person analysis with the Rasch model) is an analysis and person analysis computer program for dichotomous and rating scale data. The major advantage of IP ARM is that it constructs between statistics based on characteristics of the person for analysis or properties of the s for person analysis. It is the only software program that provides between statistics (unweighted version) for biographical subpopulations (Smith, 1991b). When biographical data are used in the analysis, it is a direct test of the invariance of the estimation of the difficulty parameter over ability groups (Smith, 1991b). When demographic data are used in the analysis to create subgroups (sex, race, and age), the resulting statistic will give an indication of the presence of bias, or differential familiarity, in response patterns for the s (Smith, 1991b). In analysis with IPARM, the software first calculates the mean squares associated with the Rasch statistics then converts them to their associated statistic with a cube root transformation. The unweighted mean square (UMSj) is defined as 2 UMSJ= f Z N^lPn(\-P n )\ Ntt where N is the number of people, X n is the observed response, and P n is the response predicted from the logit difficulty of the and the logit ability of the person (Smith,

37 b). In the Rasch model, the probability of a correct response X vi by person v with ability p v to i with difficulty (8;) can be found as P Ki= 1 I Mi} = ex P (Pv" 8i)/[l + exp((3 v - 8j)] (Wright & Stone, 1979), or _ exp (bj-di) lij, 1 + Qxp(bj - di) where bj is the ability measure for persons in score group j (Smith, 1991b, p. 153). Although these formulas appear to be considerably different, they yield the same results provided the person's ability measure is the same. The standard deviation of the unweighted mean square s can be found as S[MS(UT)i\ = N i izi w TV 1/2 The weighted mean square (WMSj) can be calculated as j^ixn-pnf WMSj = YWn n=1 where W, the weighting function, can be calculated as with a standard deviation of W= [P(l-P)]

38 29 $[MS(WT)i\ = " N N Wni -42>* _«=1 n=1 2 X n=1 1/2 (Smith, 1991b). The unweighted between mean square (UBMSj) is defined as UBMSj = l (J- l) 2L ^ w ; P (l- P ) where / is the number of score groups, N, is the number of persons in each score group, X n is the observed response for person n, and P is the predicted response for person v (Smith 1991b, p. 32). The unweighted between standard deviation can be approximated by 1/2 LC-i)J Once calculated, the mean squares are converted to unit normal statistics by the following cube root transformation formula: where V, the mean square, and S, the standard deviation, are the values associated with the mean square under consideration (Smith, 1991b; Wright & Masters, 1982). The resulting statistics have expected values of 0,1 (mean of 0 and a standard deviation of one 1).

39 30 Residual Index The Residual Index (LRI) provides a reference as to the flatness or steepness of the characteristic curve (ICC). It indicates the linear trend of the residuals summed over persons for each s. The LRI can be calculated as N LRIi = ( Y n i - Y. i ) ( b n ~ d i ) ji "Zibn-dif n-1 where dj is the difficulty of the, b is the ability of the person, N is the number of persons, X ni is the observed response, and P ni is the predicted response and a standardized residual Y ni Yni = { X n i - P n i ) v ^ Yni ~ ~ / Pni(\ - Pni) t t N ' (Smith, 1991b, p. 30). As one of the output variables from IP ARM, the LRI is a measure of how far an deviates from the common slope that is ted for all s. The index has an expected value of zero (Smith, 1991b). That is, an with an ICC that s the modeled common curve will have an LRI value of zero. Therefore, an with an LRI value greater than zero will have an ICC that is steeper than the modeled curve, and s with an LRI value less than zero will have an ICC that is flatter than the modeled curve. (Note: In the traditional classical true score approach, the point biserial correlation is used to provide an estimate of the slope of the observed characteristic curve. However,

40 31 the point biserial correlation has been found to be sample specific, and no discrete values have been established to provide an accurate interpretation of the correlation coefficient as related to the slope of the characteristic curve). IP ARM automatically assigns group membership (ability groups) based on the performance of each person on each. The program attempts to place an equal number of persons in each ability group. Negative LRI values indicate that low-ability groups should have positive residuals and high-ability groups should have negative residuals. This indicates that low-ability persons performed better than expected and high-ability persons performed less well than expected. Positive LRI values indicate that low-ability groups should have negative residuals and high-ability groups should have positive residuals. This indicates that high-ability persons performed better than expected and low-ability persons performed less well than expected. Items with negative statistics tend to have steeper observed ICCs than predicted, indicating an over to the model, and s with positive statistics tend to have flatter observed ICCs than predicted, indicating an under to the model. Purpose of Study Smith (1991b) proposed that statistics in Rasch calibration programs provide a frame of reference forjudging performance and that one way of establishing this frame of reference is to simulate data that the Rasch model over a variety of test conditions. The present Monte Carlo study investigated the effects of varying difficulty distributions, number of persons, number of s, and levels of

Unit 1 Exploring and Understanding Data

Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile