Using collateral information in the estimation of sub-scores --- a fully Bayesian approach

Size: px
Start display at page:

Download "Using collateral information in the estimation of sub-scores --- a fully Bayesian approach"

Transcription

1 University of Iowa Iowa Research Online Theses and Dissertations Summer 009 Using collateral information in the estimation of sub-scores --- a fully Bayesian approach Shuqin Tao University of Iowa Copyright 009 Shuqin Tao This dissertation is available at Iowa Research Online: Recommended Citation Tao, Shuqin. "Using collateral information in the estimation of sub-scores --- a fully Bayesian approach." PhD (Doctor of Philosophy) thesis, University of Iowa, Follow this and additional works at: Part of the Educational Psychology Commons

2 USING COLLATERAL INFORMATION IN THE ESTIMATION OF SUB-SCORES --- A FULLY BAYESIAN APPROACH by Shuqin Tao An Abstract Of a thesis submitted in partial fulfillment of the requirements for the Doctor of Philosophy degree in Psychological and Quantitative Foundations (Educational Measurement and Statistics) in the Graduate College of The University of Iowa July 009 Thesis Supervisor: Professor Walter P. Vispoel

3 1 ABSTRACT Educators and administrators often use sub-scores derived from state accountability assessments to diagnose learning/instruction and inform curriculum planning. However, there are several psychometric limitations of observed sub-scores, two of which were the focus of the present study: (1) limited reliabilities due to short lengths, and () little distinct information in sub-scores for most existing assessments. The present study was conducted to evaluate the extent to which these limitations might be overcome by incorporating collateral information into sub-score estimation. The three sources of collateral information under investigation included (1) information from other sub-scores, () schools that students attended, and (3) school-level scores on the same test taken by previous cohorts of students in that school. Kelley s and Shin s methods were implemented in a fully Bayesian framework and were adapted to incorporate differing levels of collateral information. Results were evaluated in light of three comparison criteria, i.e., signal noise ratio, standard error of estimate, and sub-score separation index. The data came from state accountability assessments. Consistent with the literature, using information from other sub-scores produced sub-scores with enhanced precision but reduced profile variability. This finding suggests that using collateral information internal to the test has the capability of enhancing subscore reliability, but at the expense of losing the distinctness of each individual sub-score. Using information indicating the schools that students attended led to a small gain in subscore precision without losing sub-score distinctness. Furthermore, using such information was found to have the potential to improve sub-score validity by addressing Simpson s paradox when sub-score correlations were not invariant across schools. Using previous-year school-level sub-score information was found to have the potential to enhance both precision and distinctness for school-level sub-scores, although not for student-level sub-scores. School-level sub-scores were found to exhibit satisfactory

4 psychometric properties and thus have value in evaluating school curricular effectiveness. Issues concerning validity, interpretability, suitability of using such collateral information are discussed in the context of state accountability assessments. Abstract Approved: Thesis Supervisor Title and Department Date

5 USING COLLATERAL INFORMATION IN THE ESTIMATION OF SUB-SCORES --- A FULLY BAYESIAN APPROACH by Shuqin Tao A thesis submitted in partial fulfillment of the requirements for the Doctor of Philosophy degree in Psychological and Quantitative Foundations (Educational Measurement and Statistics) in the Graduate College of The University of Iowa July 009 Thesis Supervisor: Professor Walter P. Vispoel

6 Copyright by SHUQIN TAO 009 All Rights Reserved

7 Graduate College The University of Iowa Iowa City, Iowa CERTIFICATE OF APPROVAL PH.D. THESIS This is to certify that the Ph.D. thesis of Shuqin Tao has been approved by the Examining Committee for the thesis requirement for the Doctor of Philosophy degree in Psychological and Quantitative Foundations (Educational Measurement and Statistics) at the July 009 graduation. Thesis Committee: Walter P. Vispoel, Thesis Supervisor Timothy N. Ansley Won-Chan Lee Catherine J. Welch Mary Kathryn Cowles

8 To Johnny, Mom and Alina ii

9 ACKNOWLEDGMENTS I would first like to thank Dr. Walter P. Vispoel, my advisor and dissertation supervisor, for supporting me both financially and academically throughout my study in Iowa. The experience of working as a research assistant for him has benefitted me in many ways as I moved forward with my career. I would especially thank Dr. Won-Chan Lee for the dedication and accommodation that he has shown while serving on my dissertation committee. Also, his expertise and insights on test theory have greatly contributed to this dissertation. Special thanks also go to other committee members for their time and contribution to this dissertation: to Dr. Cowles for helping me improve the Bayesian models, to Dr. Ansley for contributing to my literature, and to Dr. Welch for probing me to think deeper on issues related to score reporting and implementation. I am very thankful to Iowa Testing Programs for providing an excellent environment for studying measurement and statistics. I would like to say a special thankyou to professors who are not on my committee but have taught me valuable knowledge in psychometrics. These include Dr. Robert Brennan, Dr. Michael Kolen, Dr. David Frisbie and Dr. Steve Dunbar. A heartfelt thank-you also goes to Mrs. Anna Marie Guengerich for her kind smile, her warm heart and her helping hand, which has made the Blommers library not only a wonderful place to study but also a warm place to hang around. I would also like to thank my employer, Data Recognition Corporation, for being accommodating and flexible with me, which has made it possible for me to accomplish this dissertation while working at a full-time job and expecting a baby. During this process, my supervisor, Dr. Yi Du, and my close teammate, Mrs. Christie Plackner, have given me strong support as both coworkers and friends. Their patient guidance, their accommodation in work time and their friendship have truly made it easier for me to handle work and dissertation simultaneously. iii

10 To my mom, I owe the deepest debt of gratitude. Without her endless love and devotion, I would never have been who I am or accomplished what I have accomplished today. Perseverance and strong will to fight through adversity are just a few qualities that she has taught me through her life examples. To Johnny, my dear husband, no words can ever express how I appreciate and cherish his presence in my life. If I had gained only his love in coming to the United States, crossing thousands of miles would have been well worth it. His love is amazingly multidimensional: it can be as subtle as preparing a delightful raspberry yogurt breakfast, or as enlightening as introducing me to the wonders of the Bayesian world. During the process of doing my dissertation, he served as a discussant, a critic, a resource finder, a counselor, a food caterer, an entertainer, a distracter, and most importantly, a loving husband and a soul mate. Last but not the least, I would like to thank my Lord, Creator of heaven and earth, for working miracles in my life; from Whom all good things come. iv

11 TABLE OF CONTENTS LIST OF TABLES... vii LIST OF FIGURES... ix CHAPTER I. INTRODUCTION The Dilemma of Sub-score Reporting Problems with Reporting Observed Sub-scores Problems with the Current Augmented Sub-scores Orientation of the Study...7 CHAPTER II. REVIEW OF LITERATURE Empirical Bayes Estimation Kelley s Regressed Score Estimation Method Wainer s Multivariate Regressed Score Estimation Method Fully Bayesian Estimation Overview of Bayesian Inference Empirical Bayes vs Full Bayes Fully Bayesian Formulation of Kelley s Method Fully Bayesian Formulation of Wainer et al. s Method Comparison of Wainer et al. s Multivariate Empirical Bayes and Shin s Multivariate Fully Bayesian Methods Identifiability Issues for the Fully Bayesian Methods....5 Other Sub-score Augmentation Methods Studies on Wainer et al. s Method and Shin s Method Do Sub-scores Really Have Little Value? Using Collateral Information to Enhance Precision Implications for the Present Study...36 CHAPTER III. DESIGN OF STUDY Data Sample Current Score Reporting Model Specification and Operationalization Model B Model C Model E Model F Model Prior Specification Graphical Representation of Models Comparison Criteria Signal/Noise Ratio Standard Error of Estimate Sub-score Separation Index Analysis Procedures Issues in Implementing MCMC Methods Summary...58 v

12 CHAPTER IV. RESULTS Model Comparison for ELA Model Comparison for Math...83 CHAPTER V. DISCUSSION AND CONCLUSION Discussion of Research Questions Discussion for Research Question Effects of Using Information from Other Sub-scores Effects of Using Schools as Collateral Information Effects of Using Previous Year School-Level Sub-score Information Discussion for Research Question Discussion for Research Question Discussion for Research Question Discussion for Problems Related to Sub-scores Student-Level Results School-Level Results Validity, Interpretability and Suitability of Augmented Sub-scores Limitations and Future Research REFERENCES APPENDIX A. WINBUGS CODES APPENDIX B. CONVERGENCE DIAGNOSTICS FOR MODELS A, B, D, AND E...15 APPENDIX C. SUB-SCORE CORRELATIONS FOR TEN SELECTED SCHOOLS FOR ELA AND MATH...17 vi

13 LIST OF TABLES Table 3.1 Test design characteristics for ELA...60 Table 3. Test design characteristics for mathematics...61 Table 3.3 Correlations and disattenuated correlations between strands for ELA...6 Table 3.4 Correlations and disattenuated correlations between subtests for mathematics...63 Table 3.5 Descriptive information for stratum variables and demographic variables for the 007 sample...64 Table 3.6 Descriptive information for stratum variables and demographic variables for the 006 sample...65 Table 3.7 Descriptive information for ELA and mathematics raw scores for the 007 sample...66 Table 3.8 Descriptive information for ELA and mathematics raw scores for the 006 sample...67 Table 4.1 Signal noise ratios (student and school levels) for ELA...88 Table 4. Standard errors of estimate (student and school levels) for ELA...88 Table 4.3 Sub-score separation indices (student and school levels) for ELA...89 Table 4.4 Comparison of criterion ratios across models for ELA: Signal noise ratios...93 Table 4.5 Comparison of criterion ratios across models for ELA: Standard errors of estimate...94 Table 4.6 Comparison of criterion ratios across models for ELA: Sub-score separation indices...95 Table 4.7 Comparison of school-level signal noise ratios by school size for ELA...98 Table 4.8 Comparison of school-level standard errors of estimate by school size for ELA...98 Table 4.9 Comparison of school-level sub-score separation indices by school size for ELA...99 Table 4.10 Signal noise ratios (student and school levels) for math Table 4.11 Standard errors of estimate (student and school levels) for math Table 4.1 Sub-score separation indices (student and school levels) for math Table 4.13 Comparison of criterion ratios across models for math: Signal noise ratios vii

14 Table 4.14 Comparison of criterion ratios across models for math: Standard errors of estimate Table 4.15 Comparison of criterion ratios across models for math: Sub-score separation indices Table 4.16 Comparison of school-level signal noise ratios by school size for math...11 Table 4.17 Comparison of school-level standard errors of estimate by school size for math...11 Table 4.18 Comparison of school-level sub-score separation indices by school size for math Table C1 Sub-score correlations for ten selected schools: ELA...17 Table C Sub-score correlations for ten selected schools: Math viii

15 LIST OF FIGURES Figure 3.1 Graphic representation of Model A: Kelley s model...68 Figure 3. Graphic representation of Model B: Kelley s model incorporating school information...69 Figure 3.3 Graphic representation of Model C: Kelley s model incorporating school information and previous year school-level sub-score information...70 Figure 3.4 Graphic representation of Model D: Shin s model...71 Figure 3.5 Graphic representation of Model E: Shin s model incorporating school information...7 Figure 3.6 Graphic representation of Model F: Shin s model incorporating school information and previous year school-level sub-score information...73 Figure 3.7 Examples of MCMC sampling history plots...74 Figure 3.8 Examples of MCMC autocorrelation plots...75 Figure 3.9 Examples of BGR diagnostic plots...75 Figure 4.1 Student-level signal noise ratio for ELA...90 Figure 4. Student-level standard error of estimate for ELA...90 Figure 4.3 Student-level sub-score separation index for ELA...91 Figure 4.4 School-level signal noise ratio for ELA...91 Figure 4.5 School-level standard error of estimate for ELA...9 Figure 4.6 School-level sub-score separation index for ELA...9 Figure 4.7 Student-level signal noise ratio ratio for ELA...96 Figure 4.8 Student-level standard error of estimate ratio for ELA...96 Figure 4.9 Student-level sub-score separation index ratio for ELA...97 Figure 4.10 Student-level signal noise ratio for math Figure 4.11 Student-level standard error of estimate for math Figure 4.1 Student-level sub-score separation index for math Figure 4.13 School-level signal noise ratio for math Figure 4.14 School-level standard error of estimate for math Figure 4.15 School-level sub-score separation index for math ix

16 Figure 4.16 Student-level signal noise ratio ratio for math Figure 4.17 Student-level standard error of estimate ratio for math Figure 4.18 Student-level sub-score separation index ratio for math Figure B1 Example history plots for Model A...15 Figure B Example autocorrelation plots for Model A Figure B3 Example Monte Carlo error for Model A Figure B4 Example BGR plots for Model A Figure B5 Example history plots for Model B Figure B6 Example autocorrelation plots for Model B Figure B7 Example Monte Carlo error for Model B Figure B8 Example BGR plots for Model B Figure B9 Example history plots for Model D Figure B10 Example autocorrelation plots for Model D Figure B11 Example Monte Carlo error for Model D Figure B1 Example BGR plots for Model D Figure B13 Example history plots for Model E Figure B14 Example autocorrelation plots for Model E Figure B15 Example Monte Carlo error for Model E Figure B16 Example BGR plots for Model E x

17 1 CHAPTER I. INTRODUCTION The purpose of this study was to develop a series of Bayesian models and use them to examine the effects of incorporating different levels of collateral information on sub-score estimation. This chapter begins with background regarding policies for subscore reporting and problems with current reporting approaches. This is followed by an overview of existing methods for addressing the problems. The chapter concludes with the orientation of the study and the organization of the remaining chapters. 1.1 The Dilemma of Sub-score Reporting Under the No Child Left Behind Act of 001 (NCLB), every state is required to develop an accountability assessment system to measure statewide progress and evaluate school performance. NCLB further requires that state accountability assessments produce individual student interpretive, descriptive, and diagnostic reports (Public Law , section 1111(b)(3)(c)(xii)). According to Ligon (007), this mandate reflects a prevailing practice of pressing accountability assessments to provide diagnostic information, in which psychometric soundness often takes a back seat to political considerations. It is quite common to see an assessment serve more than one purpose in practice. Sometimes, however, the purposes may not be very compatible with one another. Accountability assessment and diagnostic assessment serve very different purposes. The former may be to rate schools, whereas the latter is to help diagnose achievement and inform instruction. It has been repeatedly pointed out that these two divergent purposes result in very different optimal designs and psychometric properties of tests (Wainer, Sheehan, and Wang, 000; Wainer, Vevea, Camacho, Reeve, Rosa, Nelson, Swygert, and Thissen, 001). A test that is to rank individuals and schools requires broad content coverage to ensure fairness and validity, whereas a test that is to diagnose needs to focus as narrowly as possible on specific areas of content domains.

18 From a psychometric perspective, this conflict is self-evident; from a political perspective less so. Since the advent of NCLB, teachers, principals and other educational stakeholders have lobbied politicians to stretch the use of state assessments beyond the capability of well-designed accountability measures. Diagnostic assessment is what really helps teachers focus instruction on students immediate needs. However, diagnostic assessments have been under-valued and under-funded as a result of enormous amounts of time and money invested in accountability assessments (Ligon, 007). At the same time, there is great pressure from the public to limit the number of assessments students take so that more time can be spent on instruction. Not surprisingly, teachers and other school practitioners respond to this pressure by demanding more information from any given assessment including those used for state accountability. This usually takes the form of requests for information that is more detailed than an overall score in a general subject area. For example, in addition to an overall math score, scores associated with the content domains that comprise the math test such as number sense, geometry and measures, algebra and functions, and probability and statistics might be reported. These diagnostic scores are referred to in the literature by various names such as objective scores, diagnostic scores, skill scores, domain scores, content standard scores, strand scores, and sub-scores. In this study, the scores of interest are called sub-scores hereafter. If properly derived, sub-scores can provide important information to school district administrators, teachers, parents and students. Administrators can use sub-scores to evaluate curricular effectiveness and make decisions for allocating instructional resources, setting up remedial programs and so forth. Teachers can use sub-score profile information to help revise their teaching strategies, redesign curricula when needed, identify students who need remedial instruction and place those students in appropriate instructional programs. Parents can use sub-score information to assess instructional effectiveness, consult teachers to identify learning problems and work together to build a

19 3 home environment conducive to learning. Students, in turn, can use sub-scores to determine their strengths and weaknesses and use this information to refocus their learning efforts. 1. Problems with Reporting Observed Sub-scores A common and intuitively appealing approach to score reporting is to provide observed scores for each content domain of interest. A review of current score reporting practices (Goodman and Hambleton, 004) has shown that this is an approach that most states adopt with sub-scores typically reported on raw score or percent correct metrics. Unfortunately, this simplistic approach is fraught with problems. First, sub-scores, which are usually based on a small number of items, may have low reliability and thus cannot precisely measure unique abilities (Edwards and Vevea, 006; Goodman and Hambleton, 004; Haberman, 005; Haberman, Sinharay, and Puhan, 006; Monaghan, 006; Shin, 004; Wainer, Sheehan, and Wang, 000; Wainer et al., 001; Yen, 1987; Yen, 1997; Yao and Boughton, 007). Second, for many existing assessments, there may be little distinct information in sub-scores that is not reflected in total scores (Haberman, 005; Haberman et al., 006; Monaghan, 006; Sinharay, Haberman, and Puhan, 007). Third, for most assessments, sub-scores are usually not equated or scaled to permit meaningful comparisons across forms or across domains (Goodman and Hambleton, 004; Monaghan, 006; Yao and Boughton, 007). It is of paramount importance to address these three problems before sub-scores can be used appropriately. Comparatively speaking, the third problem, equating/scaling is the easiest to solve and has been addressed in assessments such as ACT's EPAS Educational Planning and Assessment System. Specifically, this is accomplished through tight control of both technical and content specifications of the subtests. The first problem, low sub-score reliability, has attracted the most attention from researchers, and some progress has been made (Shin, 004; Thissen and Edwards, 005; Wainer et al., 001; Yao and Boughton, 007; Yen,

20 4 1987). The second problem, sub-score specificity, reflects the root of many pitfalls associated with the use of sub-scores, but unfortunately remains largely unaddressed. Sub-score specificity and estimation precision (problems 1 and ) were chosen to be the main focus of the present study. In the following sections of this chapter, I first discuss how the problems mentioned above are addressed by methods that exist in the literature, and then provide the orientation for the present study. 1.3 Problems with the Current Augmented Sub-scores It is not unusual to see a state assessment composed of sub-scores based on only around ten to fifteen items, and some sub-scores can be as small as five or six items. Such short lengths may be necessary in a practical sense, but undermine reliable measurement at the sub-score level. What can be done to make sub-scores reliable enough to be reported? A variety of statistical methods have been proposed during the last two decades to enhance sub-score reliability by augmenting data from any particular subtest with information obtained from other portions of the test. Yen (1987), for example, developed an empirical Bayes estimation procedure to combine information from the responses to a particular subtest with the score on the test as a whole to produce a more reliable measure of that subtest. This procedure is essentially a weighted combination of sub-scores with the total score and forms the basis of the "objective performance index" (OPI) reported for some tests published by CTB/McGraw-Hill. More recently, Wainer et al (001) proposed a procedure which is slightly different from Yen s OPI procedure. It is a multivariate empirical Bayes estimation method. Instead of treating the rest of the test as a unit in sub-score estimation, Wainer et al. distinguished among the other sub-scale scores in the estimation of any one of them. Consequently, Wainer et al. s method is more suitable than Yen s OPI procedure when the test is truly multidimensional. Shin s method (004) is a fully Bayesian variant of Wainer et al. s method that has been shown

21 5 to perform very similarly to Wainer et al. s method when non-informative priors are used, but give more precise estimates when informative priors are incorporated (Shin, 008). Multidimensional item response theory (MIRT) approaches have been growing rapidly, and hold promise for creating augmented sub-scores as well. Yao and Boughton (007), for example, proposed a MIRT procedure to improve sub-score precision by modeling the correlations among latent trait scores. Sub-scores yielded by these augmentation methods are usually reliable enough for reporting purposes. However, the standardized augmented sub-scores were found to be virtually the same for any examinee when subtests were highly correlated (Wainer et al., 000; Wainer et al., 001). This last issue reflects the problem of getting little distinct information in subscores that is not reflected in total scores. When augmented scores are obtained in such a context, they are essentially replications of the total score and serve little diagnostic purpose (Sinharay et al., 007). Specifically, when measurement error is taken into account, the confidence intervals associated with sub-scores for an individual usually overlap with one another, and thus strengths and weaknesses cannot be detected. This problem reflects the nature of the test more than the methodology used to develop subscores. A state assessment is usually built according to a table of content specifications, but the content areas may be so highly correlated that the test scores turn out to be unidimensional. Does this mean that sub-scores always have little diagnostic value and thus are not worth reporting? A series of research efforts by Educational Testing Services (ETS: Haberman, 005; Haberman et al., 006; Sinharay et al., 007) have been conducted to answer this question. Their results provided little support in favor of reporting sub-scores for the tests under investigation. This can be taken as a No to that question if the following three assumptions are true. The first assumption is that the value under their investigation is the same value that teachers and students are looking for. However, this assumption can be challenged. As will be discussed in more detail in Chapter II, the value

22 6 investigated in ETS research is the value concerning inter-individual decision-making such as ranking individuals based on sub-scores, whereas teachers and students are more interested in the value concerning intra-individual decision-making such as comparing sub-scores and detecting strengths and weaknesses within an individual. The second assumption is that the current methods yield unbiased results. Wainer et al. (000) acknowledged that their method could be susceptible to a phenomenon called Simpson s Paradox; that is, in all situations in which results are presented in the aggregate there is always the chance that with some kind of disaggregation very different results might occur. Mann and Moulder (008), for example, confirmed their suspicion with a simulation study and concluded that bias could occur if analyses were conducted with all groups combined and subtest trait correlations were not invariant across all groups. This problem is not just limited to Wainer et al. s method; it applies to all augmentation methods that use correlations to enhance the reliability of sub-scores. The last assumption is that there could be no improvement in current sub-score augmentation methods. Again, this assumption can be challenged. Mislevy (1987), for instance, has shown that additional stability and precision may be achieved by taking advantage of auxiliary or collateral information available about examinees. Statistically speaking, auxiliary or collateral information refers to any variable that is correlated with the target variable (Wang, Chen, and Cheng, 004). It can be test-related information within the test (e.g., item responses to other subtests), test-related information external to the test (e.g., previous test scores, or other test scores), or information about examinees (e.g., school attended, courses taken, grades obtained, or even family background). When such a broad perspective is adopted, it is not difficult to recognize that current sub-score augmentation methods are only a special case of using collateral information to improve score precision. The sub-score augmentation methods discussed up to this point only use information available in the item responses that belong to other subtests. There is still

23 7 plenty of other collateral information to exploit as long as it is appropriate for the intended use of sub-scores. 1.4 Orientation of the Study The purpose of the present study was to examine the effects of incorporating three sources of collateral information on sub-score estimation: (1) information from other subscores, () schools that students attended, and (3) school-level sub-score information on the same test taken by previous cohorts of students in that school. As can be seen from the sections above, the first source of collateral information (i.e., other sub-scores) has dominated the current literature on sub-score augmentation. One of the reasons for its dominance is that such information is internal to the test and is usually available without extra data request efforts. Information from other sub-scores is usually incorporated using multivariate (e.g., Yen, 1997; Wainer et al., 001; Shin, 004) or multidimensional (e.g., Yao and Boughton, 007) approaches. Among all the existing approaches, Shin s method (004) was chosen for the present study mainly due to its capability to incorporate more collateral information under a fully Bayesian framework. Shin s method (004) was adapted and implemented with Bayesian hierarchical models, in which correlations among sub-scores were modeled, each school was allowed to have its own mean and covariance structure, and previous test information was incorporated as priors. There are also situations in which it is not appropriate or desired to incorporate information from other sub-scores, but the other two sources of collateral information can still be used. In this case, a univariate approach is called for, in which sub-scores are treated as independent of one another, and consequently no information is borrowed from other sub-scores. As detailed in Chapter II, Kelley s regressed score method (197) is wellsuited for this purpose, and was thereby chosen to be the baseline model to which all the subsequent models were compared. Furthermore, to incorporate the other two sources of collateral information, Kelley s method was further adapted and implemented with a

24 8 Bayesian hierarchical model, in which each school was allowed to have its own mean and variance structure, and previous test information was incorporated as priors. The present study was designed to examine incremental effects of the three sources of collateral information. Six models were constructed: (1) Model A: Kelley s regressed score model, a non-augmented model which estimates sub-scores based on items only within the strand of interest, () Model B: Model A plus schools attended, (3) Model C: Model A plus both schools attended and previous year school-level sub-score information, (4) Model D: Shin s multivariate fully Bayesian model which incorporates information from other sub-scores, (5) Model E: Model D plus schools attended, and (6) Model F: Model D plus both schools attended and previous year school-level sub-score information. The first three models (A through C) are univariate, whereas the next three (D through F) are multivariate. Also, only Models A and D are non-hierarchical, while the other four are hierarchical with schools as levels. The six models were compared in terms of three criteria: signal/noise ratio (SNR), standard error of estimation (SEE), and sub-score separation index (SSI). The Markov chain Monte Carlo (MCMC) method was used for estimation. Analyses were conducted on the standardized observed score metric. Standardization was used to adjust for subscale differences in the number of items and item difficulty levels. Why are these three sources of collateral information of particular interest in this study? Above all, they are all readily available and require no additional data collection. Furthermore, they play important roles in addressing issues raised in the literature as well as connecting to previous research. As mentioned in the previous section, information from other sub-scores is used with current sub-score augmentation methods, and thus would serve as a multivariate baseline model to which subsequent models are compared. The reason for using schools as collateral information is two-fold. First, different schools have different instructional emphases and thus are likely to have different correlations among sub-scores. This can be illustrated with an extreme case. Suppose that a school

25 9 does not teach probability at all. If this is so, the correlations between the probability strand and other content strands are likely to be very low or even zero. However, most schools in the state teach probability and thus the corresponding correlations are likely to be high. Results will be biased if we allow this school to borrow information from other subtests when estimating the probability sub-score. Therefore, it would be more appropriate to allow each school to use its own mean and correlation structure, which is a direct attempt to address Simpson s Paradox brought up by Wainer et al. (000) and confirmed by Mann and Moulder (008). Second, a nice by-product of disaggregating the entire population by schools is direct estimation of school-level sub-scores as well as the precision of the estimates. There is an increasing demand for sub-scores at the school level or other aggregate levels of interest. This is due to at least two reasons: (a) sub-scores at the school level can be more reliable than individual sub-scores and thus have the promise of offering diagnostic value even when individual sub-scores have little diagnostic value (Haberman, Sinharay, and Puhan., 006; Sinharay, Haberman, and Puhan, 007); and (b) school-level subscores can also directly be used to evaluate school curricular effectiveness and help pinpoint the strengths and weaknesses in school-wide instruction. Using previous year test data also holds additional promise for improving subscore estimation in state assessments. Because state assessments usually run continuously over multiple years, the previous year s test data can readily be incorporated into the current year s sub-score estimation. There are two approaches to using previous year test data. One is to use matched longitudinal data. For example, th Grade data can be used in th Grade sub-score estimation. The other approach is to use same grade cohort data. For example, th Grade data can be used in th Grade sub-score estimation. The former involves the same students but different content, whereas the latter involves the same content but different students. It was the latter that was espoused in this study, because cohorts could be considered reasonably stable in the context of

26 10 state assessments, but content differences could be considerable across grades. Also it is important to note that when the latter approach is adopted, only school-level prior information is available for use because different students are involved and only schools remain the same. This was the case with the present study. The remaining chapters are organized as follows. Chapter II is focused on the literature on sub-score augmentation in particular and the use of collateral information in general, with an emphasis on theoretical derivations of models, empirical findings on model performance as well as some practical issues in the literature. In Chapter III, an outline of the design of the study is provided including data and sample, model specification, comparison criteria, and analysis procedures as well as research questions. In Chapter IV, the results are presented and illustrated. In Chapter V, research questions are each revisited, issues related to its implementation discussed, limitations acknowledged, and future research suggested.

27 11 CHAPTER II. REVIEW OF LITERATURE This chapter begins with an extensive treatment of Kelley s regressed score method (197) and Wainer et al. (001) s multivariate empirical Bayes estimation method followed by their reconceptualization in a fully Bayesian framework, which corresponds to the fully Bayesian Kelley s method and Shin s method (004), respectively. Wainer et al. s and Shin s methods are compared under Bayesian theory, and research findings on their psychometric performance are also presented. Sections.1 through.3 are intended to give readers a picture of how Kelley s, Wainer et al. s, and Shin s methods are related conceptually and mathematically. It is important to establish a connection between the latter two methods because there is limited research on Shin s method, and Wainer et al. s method serves as a link between Shin s method and the literature on sub-score augmentation. In Section.4, I describe an unidentifiability issue that the fully Bayesian models have and discuss the literature on this topic. In Section.5, I describe other sub-score augmentation methods existing in the literature and compare them to Wainer et al. s and Shin s methods. In Section.6, I summarize findings on subscore augmentation research. In Section.7, I present a discussion of the issue in the subscore augmentation literature of whether sub-scores have value. In Section.8, I bring the sub-score value discussion into a broader perspective and present an overview of the literature on the use of collateral information to enhance estimation precision. Finally, in Section.9, I present arguments as to why Kelley s and Shin s methods were chosen for investigation and how they could be adapted to address issues discussed in the literature..1 Empirical Bayes Estimation Wainer et al. (001) s multivariate empirical Bayes estimation method is also referred to as a multivariate regressed score estimation method, which is a multivariate generalization of Kelley s classic regressed estimate of the true score (Wainer et al.,

28 1 000; Wainer et al., 001; Thissen and Edwards, 005). The basic notion underlying both methods is to use collateral information to increase the precision of estimates (Wainer et al., 001). To be more specific, for Kelley s regressed score method, we regress the estimate toward some aggregate value (e.g., group means). In this case, the collateral information is the aggregate value. Wainer et al. s multivariate regressed score method can be considered a multiple regression of a true sub-score on all of the observed subscores on a test. Therefore, the collateral information in this case is both group mean and information from other sub-scores. To put it in less statistical terms, we augment subscores that contain meager information by borrowing strength from other parts of the test and from the group to which the student belongs. Accordingly, the method of using collateral information to estimate sub-scores is called sub-score augmentation, and the resulting sub-scores are called augmented sub-scores. The next section begins with a brief account of Kelley s method and its root in empirical Bayes framework, which is then followed by a natural extension to its multivariate counterpart, Wainer et al. (001) s multivariate regressed score method..1.1 Kelley s Regressed Score Estimation Method The well-known Kelley s equation (197) is as follows: τˆ = ρ x + ( 1 ρ) µ, (.1) in which an improved estimate of true score (τ ) is obtained by shrinking the observed score ( x ) toward the group mean (µ ) by an amount equal to the complement of the reliability of the measurement (ρ ). In practice, we substitute the sample mean ( x. ) for the population mean (µ ) and a sample estimate of reliability (r) for ρ as in Equation (.). τˆ = rx+ ( 1 r) x.. (.) The use of the sample mean in place of the population mean injects the empirical portion into the Bayes estimation. As is evident from Equations (.1) and (.), when the

29 13 observed score is very reliable, its contribution dominates the estimate; when the observed score is very unreliable, the estimate shrinks from either extreme toward the group mean. The essence of Kelley s regressed score estimation is to retain the reliable part of the observed score and remove the unreliable part by regressing it toward the group mean. Consequently, Kelley s regressed score method improves the precision by using the group mean as collateral information..1. Wainer s Multivariate Regressed Score Estimation Method As mentioned previously, Wainer et al. (001) s multivariate regressed score method is a multivariate generalization of Kelley s regressed score method. This is selfevident when we rearrange terms in Equation (.) and express them in vector notation, which becomes τˆ = x. + B(x x.), (.3) where x and x. are vectors containing several observed sub-scores and their corresponding group means, and B is a matrix that is the multivariate analog for the estimated reliability. Here bolded symbols represent vectors. The matrix B contains weights that combine the scores in x into estimates of the true scores inτ.b = S true (S obs ) -1 is conventional notation for the matrix of regression coefficients, where S true and S obs are the sample estimates of the true and observed covariance matrices, respectively. S obs can be obtained from the sample directly. How can we estimate S true? As explained in Wainer et al s (001) study, the off-diagonal elements of S true and S obs are equal because errors are uncorrelated with true scores. It is in the diagonal elements of the two matrices that the difference arises. In the diagonal of S true are the true score variances, whereas in the diagonal of S obs are the observed score variances. It is not too hard to recognize that we will be able to obtain the diagonal elements of S true, if we multiply the diagonal elements

30 14 of S obs by the reliability of the subscale in question. To summarize, the elements of S true can be computed with the following equations: and true s jj = true s jj = ρ j obs s jj for obs s jj for j j, j = j, It is customary to estimate reliability (ρ ) with Cronbach s coefficient α. Assume that the true score τ and the observed score x follow a multivariate normal distribution, with common meanµ : τ µ ~ N, x µ Then the empirical Bayes estimate of the vector of true subscale scores for examinee p, τ p, conditioned on the observed scores, is which we estimate using true true true obs. true obs 1 E ( τ x ) = µ + ( ) ( x µ ), (.4) p p p τˆ = x. + S true (S obs ) -1 (x p x.) = x. + B (x p x.) (.5) by substituting the observed estimates x. (the vector mean of the subscale scores) and S true and S obs (the sample estimates of the true and observed covariance matrices) for the population quantities µ, true and obs. Note that Equations (.3) and (.5) are the same, which confirms that Wainer et al (001) s multivariate regressed score method is a multivariate generalization of Kelley s regressed score method. The similarities and differences between these two methods can be illustrated with different types of B. When the off-diagonal elements of B are equal to zero, which means that sub-scores are totally independent, Wainer et al. s method and Kelley s method work the same way. For example, if B = I, the scores are perfectly reliable and the observed scores x are the estimated true scores (τˆ = x); if B = 0, all observed scores are regressed to the mean, x.

31 15 (τˆ = x.). However, these two methods become different, when the off-diagonal elements of B are not equal to zero. In this case, Equation (.5) allows Wainer et al. s method to borrow information from other sub-scores. We can also obtain an estimate of the conditional covariance matrix of the estimated true score by the following equation: S ) τ x = S true - S true (S obs ) -1 S true. (.6) p This equation can be used to compute the conditional standard errors for augmented subscores. These regressed sub-score estimates are more accurate in the sense that they have smaller mean squared error than estimates based solely on the items within each subscale, if the rest of the test bears any relation at all to what is being measured by the items on that subscale (Wainer et al., 001). It is evident from Equation (.5) that the improvement in precision is achieved by using both group means and other sub-scores as collateral information.. Fully Bayesian Estimation This section begins with a brief overview of the core principles of Bayesian inference (see Bolstad, 004; Gelman, Carlin, Stern, and Rubin, 003 for details), with an emphasis on distinctions between empirical Bayes and full Bayes. Then a fully Bayesian framework is used to formulate Kelley s regressed score and Wainer et al. (001) s multivariate regressed score methods...1 Overview of Bayesian Inference The centerpiece of the Bayesian framework is Bayes theorem, which was first discussed by Thomas Bayes in 1763 and was thereafter named after him. Bayes theorem is often portrayed in terms of the probabilities of discrete events. To generalize further, it can be written with respect to continuous probability density functions as: p ( θ x, η) = f ( x θ ) π ( θ η) / m( x η), (.7)

32 16 where x denotes observed data and θ the unknown parameters of interest. Equation (.7) contains three essential elements of Bayes theorem: the prior probability distribution ( π ( θ η) ), the likelihood function ( f ( x θ ) ), and the posterior distribution ( p ( θ x, η) ). The quantity in the denominator ( m ( x η) ) denotes the marginal distribution of x, and is often referred to as a normalizing constant as its value generally makes p( θ x, η) a proper density. The prior distribution ( π ( θ η) ) summarizes what is known about θ before any data are collected. The likelihood function f ( x θ ) provides the distribution of the data, x, given the parameter valueθ. The posterior distribution, p ( θ x, η), summarizes the information in the data, x, together with the information in the prior distribution, π ( θ η). Thus Bayes theorem provides a formal mechanism to update the prior distribution to a posterior distribution after seeing the data. The current posterior distribution can be used as a prior distribution for the next year, hence Bayesian inference provides a natural way to represent the learning that occurs as more information is collected over time... Empirical Bayes vs Full Bayes Carlin and Louis (000) provided a comparison between empirical Bayes and full Bayes. If η in Equation (.7) is unknown, then information about it is captured by the marginal distribution m ( x η) in the denominator. An empirical Bayes analysis uses this marginal distribution to estimate η by ˆ η ˆ( η y) [e.g., the marginal maximum likelihood estimator (MLE)] and then uses p( θ x, ˆ η) as the posterior distribution. In contrast, a fully Bayesian analysis augments Equation (.7) by a hyperprior distribution, h ( η λ), and computes the posterior distribution as f ( x θ ) π ( θ η) h( η λ) dη p( θ x, λ) = = p( θ x, η) h( η x, λ) dη. (.8) f ( x θ ) π ( θ η) h( η λ) dθdη

33 17 The second representation shows that the posterior is a mixture of the posterior in Equation (.7) and the hyperprior updated by the data x. As is evident from the mathematical illustration above, empirical Bayes uses the data to help determine the prior through estimation of the hyperparameter η, whereas full Bayes places a distribution h ( η λ) on η to characterize uncertainties associated with not knowing η. Consequently, empirical Bayes uses the data twice (first to help determine the prior, then again when computing the posterior), and thus leads to overestimation of the precision of the parameters. Carlin and Louis (000) argued that empirical Bayes methods were often thought of as approximations to overly cumbersome fully Bayesian analyses. However, with the widespread availability of MCMC tools, the need for such approximations has more or less vanished. Another problem with empirical Bayes is that a point estimate of η is used to compute the posterior distribution. As Gelman et al. (003) pointed out, point estimates are arbitrary, and using any point estimates necessarily ignores posterior uncertainty. Full Bayes handles this problem by using a distribution h( η λ) rather than a point estimate ηˆ. Based on these two arguments, full Bayes is considered more appropriate than empirical Bayes; accordingly, it is worthwhile to consider a reformulation of Kelley s method and Wainer et al. (001) s method in a fully Bayesian framework...3 Fully Bayesian Formulation of Kelley s Method Mislevy (1993) has provided a derivation of Kelley s regressed score equation within a fully Bayesian framework. In the classical test theory (CTT) model, a person s observed score can be expressed by x p is conceptualized as the sum of his true score τ p and error e p, which e ~ N(0, σ ), and ρ ( τ, ) = 0 p p e x p = τ p + e p, where it is often assumed that τ p =Ε( x p ), p e p τ can be written as x τ ~ N( τ, σ ) observation of p p p. Equivalently, the conditional distribution of x p given x, or λ τ x ), can be expressed as ( p p p e. The likelihood function for τ p induced by

34 18 λ ( τ x ) = N( x, σ ), (.9) p p which represents our degree of belief about the relative likelihood of various values of τ p given that the test has been administered and a score x p obtained. To enact the fully Bayesian approach, we need to express our uncertainty about the unknown parameterτ p. In Mislevy s derivation, it is assumed that p e τ p ~ N( µ τ, σ τ ), (.10) which represents our degree of belief about the value of τ p before administration of the test; therefore, in Bayesian terminology, it is a prior distribution. Note that it only draws upon the person s membership of the population. Supposeµ τ, σ τ, and σ e are known. The posterior distribution can be obtained by combining the prior and the likelihood as represented by Equations (.9) and (.10). Since the normal prior is conjugate for the normal likelihood, the resulting posterior is also normal, which can be expressed as with posterior mean and with posterior variance τ x, µ, σ, σ ~ N( ˆ τ, ˆ σ ), (.11) p p τ τ ˆ τ ) e p τ p p = ρx p + (1 ρ µ τ, (.1) ˆ σ 1 τ = ( στ + σ ) = (1 ρ) σ p e τ, (.13) where ρ is the reliability coefficient, defined by ρ = σ /( σ ). The posterior mean τ τ + σ e is a linear combination of the prior mean and the data; the posterior precision is the sum of the prior precision and the precision in the data. Therefore, the posterior distribution represents our degree of belief about the value of τ p after we know the person s membership of the population and his test score x p. Equation (.1) is essentially identical to Equation (.1), which is Kelley s regressed score formula. To this point the results are still equivalent under the fully Bayesian and empirical Bayes approaches. In reality, however, µ τ, στ and σ e are

35 19 unknown. These two approaches differ in how they handle unknown parameters. Empirical Bayes substitutes sample estimates for population parameters. The fully Bayesian approach places a prior distribution on the unknown parameters, which well preserves the uncertainty associated with the unknown parameters. Conjugate priors are usually used for algebraic convenience. In this case, a normal prior is put onµ τ, and gamma priors on ( σ and 1 τ ) ( σ 1 e ), respectively...4 Fully Bayesian Formulation of Wainer et al. s Method Shin (004) re-parameterized Wainer et al. (001) s multivariate regressed score method in a fully Bayesian framework, which can be considered a multivariate version of Bayesian formulation of Kelley s regressed score method (Mislevy, 1993). Conceptually such as an extension is quite straightforward, but mathematically it is fairly challenging. For the sake of simplicity, an example with two sub-scores is used for illustration purposes. Suppose the two subtests are called M and V. In the CTT model, x pv = τ pv + e pv, and x pm = τ pm + e pm, where τ pv =Ε( x pv ), τ pm =Ε( x pm ), ρ ( τ, e pv pv ) = 0, ρ ( τ, ) = 0, ~ (0, e N σ ), and e ~ N(0, σ ). Analogously, the likelihood pm e pm pv functions for τ pv and pm and ev τ induced by observations of x pv The prior forτ pv and τ pm is τ τ pv pm pm em and x pm can be expressed by λ ( τ x ) = N( x, σ ), (.14) pv pv pv ev λ ( τ x ) = N( x, σ ). (.15) pm pm pm µ v σ τv ~ MVN, µ m στ vm em σ σ τmv τm, (.16) which represents our belief about the values of τ pv and τ pm as well as their relationship. Again, supposeµ v, µ m, σ τv, σ τm, σ τvv, σ ev, and σ em are known. The posterior

36 0 distribution can be obtained by combining the prior and the likelihood as represented by Equations (.14) through (.16). After some calculation, it is found that and τ pv 1 x pm µ m 1 x pm, x pv ~ N µ v , στv x 31 pv µ v 11 1, (.17) where τ pm 1 x pm µ m 1 x pm, x pv ~ N µ m , στm x 1 pv µ v στ m + σ = στ mv σ τmv, σ + σ em 11 τv ev 11 1, (.18) and = στ m = στmv σ τmv 1 στ v =[ ] 1 τ σ m τmv σ, =[ ] 31 τmv στv σ. When Equations (.17) and (.18) are compared with Equation (.4), it is evident, that the multivariate fully Bayesian method yields equivalent results to Wainer et al. (001) s multivariate empirical Bayes method. In reality, however, µ v, µ m, σ τvv, σ ev, and σ τv, σ τm, σ em are unknown. To implement the fully Bayesian approach, priors for these parameters need to be specified. As compared to the fully Bayesian specification of Kelley s univariate regressed score method, the prior specification for the multivariate model presents additional complexity. Conventionally, the following semi-conjugate priors are used: µ v µ 0 ~ MVN µ m µ 0 v m σ 0, σ 0 v vm σ σ 0mv 0m, (.19)

37 1 ( α ) ( σ, 1 ev ) ~ Gamma v β v ( α ) ( σ, 1 em ) ~ Gamma m β m, (.0), (.1) and στ v στ vm σ τmv σ τm 1 ~ Wishart ( R, ν). (.) Wishart is a multivariate analogue of gamma. The parameter R may be thought of as an assessment of the covariation in the data, X X, and υ may be thought of as degrees of freedom. According to Spiegelhalter, Thomas, Best, and Gilks (1996), specifying the distribution as Wishart ( R,ν) implies that one s best estimate for the covariance matrix is R /υ, and that estimate is based on υ observations. In practice, if a covariance matrix from prior data prior is available, R is usually specified as the product of υ and prior. What is noteworthy is that by using these hyperpriors, the fully Bayesian regressed score method has avoided using point estimates of x. and S true..3 Comparison of Wainer et al. s Multivariate Empirical Bayes and Shin s Multivariate Fully Bayesian Methods Shin (004) first developed the multivariate fully Bayesian method. To demonstrate its estimation accuracy in comparison with a known and established scoring procedure, Shin (004) also compared this method to its empirical Bayes counterpart, which is the regular Wainer et al. s (001) method. A simulation study was designed in which three factors were manipulated: the number of items in a subtest, the number of subtests, and the correlation between subtests. These two methods were evaluated across all the conditions. It was found that across conditions, their performance was very similar in terms of absolute bias, standard error and root mean square error. Despite the similar performance of these two methods, Shin (008) pointed out that they mainly differ in three ways. First, Wainer et al. (001) s empirical Bayes method lacks the capability to compute the conditional standard error of measurement,

38 whereas Shin s fully Bayesian method can produce standard error of measurement for each individual. Second, Shin s method involves the use of MCMC and thus is more computational complex than Wainer et al. s method. Third, as more prior information is available, Wainer et al. s method does not have the mechanism to improve the estimation by incorporating the prior information, whereas Shin s method can incorporate prior information by using Bayesian sequential analysis. Shin (008) also pointed out that the similar performance of these two methods as shown in Shin (004) was expected given that only noninformative priors were used for Shin s fully Bayesian method. Improvement should be expected to occur when the full potential of the fully Bayesian approach is tapped. In addition to the three differences mentioned in Shin (008), these methods also differ in the output that can be generated; Wainer et al. s empirical Bayes method only generates a point estimate for each individual and an overall standard error of measurement, whereas Shin s fully Bayesian method generates a posterior distribution, which provides a fuller picture of the parameter of interest and allows for more complex inferences. Additionally, the methods also differ in how they handle unknown parameters; Wainer et al. s method ignores the uncertainty associated with unknown parameters and uses sample point estimates for estimation, whereas Shin s method acknowledges the uncertainty by putting prior distributions on unknown parameters..4 Identifiability Issues for the Fully Bayesian Methods A potential pitfall that comes along with the ease and convenience of Bayesian modeling is the temptation to fit models larger than the data can readily support (Eberly and Carlin, 000). This is evidently the case with the fully Bayesian variants of both Kelley s and Wainer et al. (001) s methods. Specifically, both methods are based on score rather than item level data. Score level data cannot provide information about and σ e individually, but only about their sum, which is equal to observed score variance, σ x. As such, σ τ and σ e are not individually identifiable. This is an issue of σ τ

39 3 identifiability, which a Bayesian might refer to as likelihood identifiability (Eberly and Carlin, 000). Posterior impropriety resulting from a likelihood unidentified parameter having an improper prior will cause convergence failure in the MCMC context. To resolve the unidentifiability problem, additional information outside data is required. For a frequentist, the only solution is to add constraints on σ τ and σ e to make them individually identifiable, whereas a Bayesian can choose whether to add such constraints or to place proper prior distributions on the unidentified parameters. Fortunately, research on this topic to date (e.g., Eberly and Carlin, 000; Lindley, 1971; Rannala, 00; Xie and Carlin, 006) has shown that identifiability is not a real problem for Bayesian analyses since the issue can always be resolved via proper priors. In the Bayesian paradigm, proper prior distributions necessarily lead to proper posteriors hence every parameter can be estimated. The key is to choose appropriate priors that will produce acceptable MCMC convergence behavior. Eberly and Carlin (000) investigated the effects of using different priors on convergence using a simple example of an unidentified model: σ 1 σ y θ, φ ~ N( θ + φ,1) where N(0, ) and N(0, ) are priors for θ and φ, respectively. Evidently, the single data point yi cannot possibly provide information to separate θ and φ ; only their sum is identifiable. In the study, σ 1 was set to infinity, which was equivalent to assuming a flat prior on θ, while values of σ were manipulated to reflect three levels of prior on φ : flat, tight, and moderate. Results revealed that when an essentially flat prior was used on φ, both θ and φ failed to converge, whereas when an extremely tight prior was used on φ, both θ and φ converged immediately. An interesting finding was when a moderately informative prior was used, the convergence behavior depended on the starting value for φ. Specifically, when the starting value for φ was not a plausible draw from the true stationary distribution of φ, results suggested a failure of convergence for θ and φ, as

40 4 well as a slow convergence of their sum, which was an identified parameter in the model. Therefore, Eberly and Carlin (000) provided a recommendation for Bayesian practitioners, which was to play it safe and use tightly informative priors for parameters not in the identified subset. After all, this is the Bayesian equivalent of the usual frequentist approach to this problem; namely, imposing constraints on the parameter space. To be more specific, placing an extremely tight prior on φ is equivalent to setting φ to the mean of its prior distribution..5 Other Sub-score Augmentation Methods The previous sections included a treatment of both Wainer et al. (001) s multivariate empirical Bayes method and Shin (004) s multivariate fully Bayesian method. Connections between these two methods have been established through both Bayesian derivation and simulation research. Shin s fully Bayesian method has been shown to be theoretically superior to Wainer et al. s empirical Bayes method, although they perform similarly when non-informative priors are used. It is also important to acknowledge the existence of other sub-score augmentation methods in the literature. Apart from Wainer et al. s and Shin s methods, some important methods include Yen s OPI (1987), Brennan s Multivariate Generalizability Theory approach (001), and MIRT approach (e.g., Tate, 004; Thissen and Edwards, 005; Walker and Beretvas, 003; Wang, Chen, and Cheng, 004; Yao and Boughton, 007; Zhu and Stone, 008). In this section, these methods are briefly described, and their advantages and disadvantages are analyzed with reference to Wainer et al. s method. Little explicit reference is made to Shin s method given the similarities between these two methods as well as the limited research on that method. Yen s OPI (1987) is a procedure that combines performance on a particular subtest with information from the examinee s overall test performance. This goal is accomplished by combining an empirical Bayes approach with an IRT domain score

41 5 estimation method (Bock, Thissen, and Zimowski, 1997). Yen (1987) deals only with multiple-choice (MC) items, and Yen et al. (1997) extends the method to items of mixed types. For the ease of illustration, only the method for MC items is presented here. The OPI T for subtest j, T j, can be expressed by Equation (.3): T j x j = w j + ( 1 w j ) Tˆ j, (.3) n j where x n j j is the observed proportion correct score for subtest j, Tˆ j is the estimated proportion correct score for subtest j given θˆ based on the total test using the IRT n j 1 domain score method ( T ˆ ( ˆ j = Pij θ ) ), and w j is the relative weight given to the n j i= 1 observed proportion correct score. From a Bayesian perspective, Tˆ j is the prior containing information from the total test, which is usually more reliable; x n j j is the data containing information only from observed scores of subtest j; w j and 1 w j represent weights assigned to the data and the prior, respectively; T j is the posterior mean resulting from combining the prior with the data. Yen s OPI is empirical Bayes because both Tˆ j and w j are estimated from the data. When Equation (.3) is compared to Equation (.) and Equation (.3), it is not too hard to discover how Yen s OPI is related to Kelley s and Wainer et al. (001) s methods and how they differ. All three methods adopt an empirical Bayes approach, augmenting observed sub-scores with prior information. What makes them different is the prior information they use. Kelley s method only uses the group mean as the prior, whereas Yen s OPI method uses the total test score as the

42 6 prior; Instead of treating the rest of the test as a unit as Yen s OPI method does, Wainer et al. s method distinguishes among the other sub-scores in the estimation of any one of them, and thus the prior used is a vector of other sub-scores. Consequently, as compared to Wainer et al. s method, Yen s OPI method is less suited to the estimation of sub-scores when the test is truly multidimensional. This relationship has been confirmed by studies using both simulated data (Shin, 004) and empirical data (Dwyer et al., 006). Therefore, it is not considered in this study. Brennan (001) has proposed a method of estimating sub-scores from a multivariate generalizability approach. It is a multivariate extension of Kelley s regressed score method to models of the type used in generalizability theory. Or, equivalently, it is an extension of Wainer et al. (001) s method in the framework of generalizability theory. The distinction between Wainer et al. s and Brennan s methods boils down to that between classical test theory and generalizability theory. Specifically, classical test theory uses an undifferentiated error term and assumes error to be uncorrelated, whereas generalizability theory explicitly incorporates multiple sources of error and accounts for correlated error. Generalizability theory is advantageous either when there are multiple sources of error (e.g., items, raters, occasions), or when correlated error is present. In the context of sub-score estimation, it is the latter that is of more concern. Correlated error becomes an issue when items are crossed with the fixed categories. An example is to evaluate essays analytically according to a rubric (e.g., development, focus, organization, language use, and grammar), in which the five evaluative criteria are the fixed categories, and essays are crossed with them. In this case, Brennan s method is more appropriate than Wainer et al. s method. For most state assessments, however, it is of most interest to estimate sub-scores based on content categories. In this case, items are usually nested within content categories, and correlated error becomes less of an issue. Given that the

43 7 present study is mainly focused on sub-score estimation in the context of state assessments, Wainer et al. s method is a nice substitute for Brennan s method due to its prevalence and ease of use; therefore, Brennan s method was not considered, but is strongly recommended for future research. MIRT is a growing methodology for modeling the relationships of examinees to a set of test items using a matrix of responses. It holds enormous promise in sub-score augmentation because it improves estimation precision by explicitly modeling the relationships among sub-scores and jointly estimating all the model parameters ξ,µ, and (Wang, Chen, and Cheng, 004). Only if the variance-covariance matrix is a diagonal matrix, which means that the latent traits are independent, will measurement efficiency of the multidimensional approach be equivalent to that of the unidimensional approach. This improvement in precision has been reported in many studies in which multidimensional and unidimensional IRT approached were compared (Tate, 004; Thissen and Edwards, 005; Walker and Beretvas, 003; Wang, Chen, and Cheng, 004; Yao and Boughton, 007; Zhu and Stone, 008). As Thissen and Edwards (005) pointed out, when IRT θ estimates are used in Wainer et al. (001) s method, it can be considered a special case of a MIRT model. The special case involves at least the following constraints or compromises: (1) there are as many dimensions as there are subtests; () each item can load only on one subtest (simple structure); and (3) item parameters are estimated based on a unidimensional IRT model, and then are used to estimate θ j for each subtest. As compared to Wainer et al. s method, MIRT is advantageous especially when the test is constructed with IRT models and shows some multidimensionality. However, MIRT has limited applicability due to a lack of MIRT estimation programs, especially those that can handle tests with mixtures of dichotomous and polytomous items (Yao and Boughton, 007). Recently, a Bayesian multivariate item response theory (BMIRT; Yao, 003) program has been developed, but it has not been made available for commercial use yet. Another drawback that limits MIRT from

44 8 widespread application is its intensive computational demand (Zhu and Stone, 008). Consequently, it often takes a long time to produce score estimates, and thus is not very feasible in the context of state assessments, which usually requires fast reporting of scores. Given these concerns, MIRT was not considered in the study, but is highly recommended for future use when these practical constraints can be overcome. Although the three methods described above are not investigated in this study, it is helpful to know how they perform as compared to Wainer et al. (001) s and Shin (004) s methods. Unfortunately, no empirical study has been done to examine Brennan s method yet; therefore only comparisons of Yen s OPI and MIRT methods with Wainer et al. s and Shin s methods are discussed here. Shin (004) compared Yen s OPI, Wainer et al s and Shin s methods under various simulation conditions. As expected, Wainer et al s and Shin s methods outperformed Yen s OPI method in producing more precise sub-score estimates, and this difference decreased as correlation between subscales increased. This finding is a direct result of Yen s method treating the rest of the test as a unit whereas Wainer et al. s and Shin s methods distinguish among all the other subscales. When the test is essentially unidimensional, these three methods tend to produce similar results. Dwyer, Boughton, Yao, Lewis, and Steffen (006) compared Yen s, Wainer et al. s, and MIRT methods with empirical data in a unidimensional scenario and a multidimensional scenario. Wainer et al. s and MIRT methods clearly outperformed Yen s OPI method in both scenarios. Wainer et al. s method produced more accurate results than MIRT in the unidimensional scenario, but the situation reversed in the multidimensional scenario. Zhu and Stone (008) also compared these three methods with empirical data and found that augmented scores produced by these methods were highly correlated, with Wainer et al. s method more highly correlated with MIRT than with Yen s OPI method.

45 9.6 Studies on Wainer et al. s Method and Shin s Method The last decade has seen a number of studies into the performance of Wainer et al s method with both simulated data and real empirical data (Dwyer et al., 006; Edwards and Vevea, 006; Shin, 004; Wainer et al., 000; Wainer et al., 001; Zhu and Stone, 008). There are also studies into Shin s method (Shin, 004; Shin, 008). This section focuses on research findings on the superior performance of augmented scores as compared to non-augmented scores as well as some problematic issues of which people need to be aware. Although it flows naturally from Bayesian theory that, with an appropriate weighting scheme, the empirical Bayes estimates will outperform scores that have not been adjusted, the extent and practical significance of this benefit needs to be determined. Edwards and Vevea (006) conducted a simulation study to examine the effects of Wainer et al. (001) s augmentation method by comparing augmented sub-scores with non-augmented sub-scores in a variety of realistic conditions. Simulation conditions included number of subscales, length (hence, reliability) of subscales, and the underlying correlations between subscales. Evaluative criteria included root mean square error (RMSE), reliability, percentage correctly identified as falling within specific proficiency ranges, and the percentage of simulated individuals for whom the augmented score was closer to the true score than was the non-augmented score. As expected, augmented subscores exhibited smaller RMSE, had high reliability, and correctly placed a higher percentage of simulees in appropriate ability groups. As compared to non-augmented sub-scores, augmented sub-scores were also closer to true sub-scores for a higher percentage of simulees. This improvement was a function of correlation between subscales and subscale length (reliability). The largest gains were seen in cases where the correlations between subscales were high, the reliability of the subscale being augmented was low, and the reliability of the subscale providing collateral information was high. For example, when the subscale being augmented was small (5 items), the subscale providing

46 30 collateral information was large (40 items), and the correlation between the subscales was high (r = 0.9), the use of augmentation on IRT scale scores resulted in a reduction in RMSE of 0.5 from 0.76 to 0.51, an increase in reliability from 0.4 to 0.74, 13% more simulees correctly classified into ability groups, and 16% more simulees with sub-scores more accurately estimated. In a case that closely resembled a state assessment, say, a 10- item subscale being augmented by a 40-item subscale with a correlation of 0.9, the use of augmentation on IRT scale scores resulted in a reduction in RMSE of 0.16, an increase in reliability from 0.59 to 0.89, and 13% more simulees with sub-scores more accurately estimated. Even in the worst case of a 40-item subscale being augmented by a 5-item subscale with a correlation of 0.3, the effect of augmentation was almost negligible, but there was still some improvement; at least it did no harm. Studies using empirical data have also shown that the augmented sub-scores yielded by Wainer et al.(001) s method were usually more reliable than non-augmented sub-scores (Dwyer et al., 006; Shin, 004; Wainer et al., 000; Wainer et al., 001; Zhu and Stone, 008). However, this gain in precision came along with a problem; some researchers found that for many assessments augmented sub-scores were highly correlated, and thus the sub-scores did not seem to serve any useful diagnostic purposes (Wainer et al., 001; Zhu and Stone, 008). This has triggered a series of research studies into the value of sub-scores (Haberman, 005; Haberman et al., 006; Sinharay et al., 007). The basic rationale underlying these studies is that a sub-score would be deemed to have value if the sub-score provides a more accurate measure of the construct it measures than is provided by the total score from the larger test. The proposed criterion is mean squared error (MSE). The basic strategy is that both observed sub-score and total score are used to approximate unobserved true sub-score; if the MSE is smaller when observed sub-score is used to approximate true sub-score than when total score is used, then we say that observed sub-score is a more accurate measure of the construct measured by true sub-score than total score, and thus has additional value over total

47 31 score. Haberman (005) examined SAT I math and verbal examinations and Praxis examinations titled Fundamental Subjects: Content knowledge with code The SAT I math and SAT I verbal examinations have sub-scores that are usually highly correlated, whereas the Praxis examinations under investigation measure very distinct content areas. Not surprisingly, when MSE was used as the criterion, the results supported reporting of sub-scores for the Praxis examinations but not for SAT I math and verbal examinations. Haberman (005) concluded that with the criterion of MSE of approximation of true subscores, observed sub-scores were most likely to have value if they had relatively high reliability by themselves and if the true sub-score and true total score had only a moderate correlation. Haberman et al. (006) and Sinharay et al. (007) verified the importance of both of these conditions when they applied this procedure to other assessment data..7 Do Sub-scores Really Have Little Value? For most state assessments, sub-scores usually have moderate reliability and are often highly correlated. Neither of the conditions mentioned in Haberman (005) is likely to hold. Does it mean that sub-scores have little value and thus are not worth reporting for most state assessments? In this section, I discuss this claim by raising three questions. First, is the value under previous investigation the same value for which teachers and students are looking? Tate (004) made a distinction between two types of decisions based on sub-scores. One is an inter-individual decision, which considers one sub-score at a time and compares individuals based on that sub-score; the other is an intraindividual decision, which considers one individual at a time and compares sub-scores within that individual. The former is of interest when sub-scores are used separately to compare individuals (e.g., to rank individuals based on sub-scores, to compare sub-scores to certain cut scores); the latter is of interest when sub-scores are used jointly to examine score profiles (e.g., to compare sub-scores within an individual and detect strengths and

48 3 weaknesses). The criterion of MSE of approximation of true sub-scores as proposed in Haberman s study (005) is concerned primarily with inter-individual decision-making. In the context of state assessments, however, teachers and students mainly use sub-scores to help detect strengths and weaknesses, and thus are primarily interested in the value for relative decision-making. Therefore, the value investigated in previous studies (Haberman, 005; Haberman et al., 006; Sinharay et al., 007) is not exactly the same value for which teachers and students are looking; consequently, their conclusion concerning the value of sub-scores actually refers to the value for absolute decisions, but not necessarily the value for relative decisions. The existence of substantial research on profile analysis lends evidence to the value of sub-scores for detecting strengths and weaknesses. A review of literature on profile analysis is beyond the scope of the study. Interested readers are referred to Watkins et al. (005). The second question concerns the augmentation method itself. Do the current subscore augmentation methods have any flaws? Wainer et al. (000) acknowledged that their method could be susceptible to a phenomenon called Simpson s Paradox, that is, in all situations in which results are presented in the aggregate there is always the chance that with some kind of disaggregation very different results might occur. To be more specific, Wainer et al. s method assumes that correlations between sub-scores are the same for any subpopulation; when sub-score correlations are not invariant for subpopulations, using the same correlations for all subpopulations could potentially lead to unfair adjustment of sub-scores. Two concrete examples are as follows. Suppose for a recent immigrant population, verbal and math scores are likely to be uncorrelated, therefore it would not be appropriate to use verbal scores to adjust math scores or vice versa. Similarly, for students studying in a school where a certain content strand (say, probability) is not taught, at least not taught to the same extent as most other schools in the state, the correlations between that strand and other strands may be near zero or very low; in this case, it would be inappropriate to augment the sub-score of that content

49 33 strand for students in that school in the same fashion (using the same correlation matrix) as for students in other schools. Mann and Moulder (008) investigated the potential bias with a simulation study, in which datasets that exhibited group differences in the correlations between sub-scores were simulated. The results confirmed the suspicion raised in Wainer et al. s study (000), concluding that bias could occur if analyses were conducted with all groups combined when subtest trait correlations were not invariant across all groups. It is important to know that this problem is not just limited to Wainer et al. s method; it applies to all the methods that use correlations to enhance sub-scores, which is true with all the existing sub-score augmentation methods. As mentioned previously, augmented sub-scores are usually so highly correlated that they are nearly statistically indistinguishable (Wainer et al., 001; Zhu and Stone, 008). Therefore the third question is whether high correlations necessarily imply that they are the same construct or at least almost inseparable. As Wang, Chen and Cheng (004) argued, high correlations between constructs do not necessarily mean that they are the same construct. For instance, even though height and arm length are almost perfectly correlated, they are still distinct constructs; with sufficiently precise measures, we will be able to find out that although for most people arm length is proportional to height, there still exist some people whose arm length is unusually long or short for their height. A similar case can be made for sub-scores. They are highly correlated and seem to be indistinguishable given the current measures. However, if we increase the precision of our measures, we will be able to better detect strengths and weaknesses based on subscores. So the key is to increase estimation precision..8 Using Collateral Information to Enhance Precision Mislevy (1987) pointed out that additional precision might be achieved by incorporating collateral information. Statistically speaking, collateral information refers to any variable that is correlated with the target variable (Wang et al., 004). It can be test

50 34 information within the test (e.g., item responses to other subtests), test information external to the test (e.g., previous test scores, or other test scores), or examinee information (e.g., school attended, courses taken, grades obtained, or even family background). In the paradigm of sub-score augmentation, the collateral information used is either total test scores (Yen, 1987) or item responses to other subtests (Wainer et al., 001; Yao and Boughton, 007), which is limited to information available on the current test. Shin (008) attempted to augment sub-scores using collateral information outside the current test. The rationale for Shin s study (008) is that a testing program usually spans more than one year, and previous year sub-scores can be incorporated into current year sub-score estimation. Specifically, the previous year s posterior distribution can be used to construct priors for estimating current year s sub-scores, which is called Bayesian sequential analysis. According to the Bayesian theorem, the posterior precision is equal to the datum precision plus the prior precision (Lee, 1989). Consequently, Bayesian sequential analysis enhances estimation precision by incorporating informative prior information. Shin (008) has confirmed this theoretic benefit with a simulation study, showing that results improve each year in terms of standard error and root mean square error, and the biggest improvement occurs in the fourth year. Outside the paradigm of sub-score augmentation, using collateral information has wide application in areas such as item response theory, where more types of collateral information have been used to enhance estimation precision. Generally speaking, two major types have been investigated in the literature: item information and examinee information. Regarding the use of item information, literature has shown the use of a variety of collateral item information to improve item parameter estimation. For example, some early research focused on defining item difficulty parameters in terms of underlying cognitive tasks (e.g., Suppes, Jerman, and Brian, 1968; Fischer, 1973; Fischer and Formann, 198). Mislevy (1988) proposed a model that was intended to improve item difficulty estimates by combining Bayes estimation procedures and the linear logistic test

51 35 model. The fundamental rationale was that only those items which shared the same set of relevant, salient features were considered exchangeable, and this was formally realized in the Bayesian procedure through assigning different priors to items with different combinations of relevant features. According to Bayes theorem, the precision of the empirical Bayes estimate is the sum of the precision afforded by the likelihood and the collateral information, proving that estimation precision increases as collateral information is incorporated into the process. Also, the significance of this effect decreases as the number of items and/or examinees increases. These theoretical results have been confirmed empirically in Mislevy (1988). Among other things, Hall (007) examined different ways of constructing collateral information to incorporate into prior specification of IRT parameter estimation. In general, she found that incorporating readily available collateral item information produced more accurate and precise IRT parameter estimates in terms of RMSD and standard error, and such gains were consistent for all sample sizes and sample representations under her investigation. The incorporation of collateral information regarding examinee characteristics is grounded on the knowledge that as far as educational assessment data are concerned, the distribution of ability may differ depending on some characteristics of examinees, such as demographics, socio-economic status, and instructional history variables. Thus, it would be unrealistic to assume that all examinees come from a single, undifferentiated population. Mislevy (1987) was the first to propose a population-differentiated model to incorporate examine information into item parameter estimation. He compared precision of item parameters estimated with and without differentiation by subpopulation membership, and found that using subpopulation membership as collateral information enhanced estimation precision of difficulty parameter estimates. A number of research also studied the effects of incorporating other types of collateral information regarding examinees into the parameter estimation process (e.g., Mislevy and Sheehan, 1989; Beaton and Zwick, 199; Mislevy, Beaton, Kaplan, and Sheehan, 199; Muthen, 1989;

52 36 Adams, Wilson, and Wu, 1997). For example, Mislevy and Sheehan (1989) showed that using collateral information about persons background variables, such as educational background or status on demographic variables, increased estimation precision of item parameters and person abilities. Results also revealed that using collateral information was the most beneficial when the relationship between the collateral variables and the latent trait was strong and the number of item responses per examinee was small. The use of collateral information regarding both items and examinees has been found to be beneficial in terms of estimation precision in the literature of IRT parameter estimation. The latter bears obvious relevance to the present investigation of sub-score augmentation. However, none of the aforementioned types of collateral information regarding examinees has yet been applied in the area of sub-score augmentation, although they seem to hold substantial promise. As a matter of fact, Mislevy s study (1987) has direct relevance to the present study in that school membership is also used as collateral information in sub-score estimation in this study and students are differentiated by the schools that they attend..9 Implications for the Present Study As demonstrated through the mathematical derivation under Bayesian theory, Wainer et al. (001) s and Shin (004) s methods are closely related both mathematically and conceptually. Specifically, Shin s method is a fully Bayesian variant of Wainer et al. s method. Shin (004) showed with a simulation study that these two methods also performed very similarly when non-informative priors were used. However, Shin s method is argued to be superior for three major reasons. First, Wainer et al. s method is an empirical Bayes approach and consequently suffers the drawback of using the data twice thus leading to overestimation of the precision of the parameters; Shin s method adopts a fully Bayesian approach and acknowledges the uncertainty associated with unknown parameters by putting prior distributions on them. Second, Wainer et al. s

53 37 method only generates a point estimate for each individual and an overall standard error of measurement, whereas Shin s fully Bayesian method generates a posterior distribution, which provides a fuller picture of the parameter of interest and allows for more complex inferences. Third, as more prior information is available, Shin s method can incorporate this information by using Bayesian sequential analysis, but Wainer et al. s method does not have the mechanism to do so. The three reasons cited above make Shin s method a promising approach to sub-score augmentation. However, it has not received as much attention in the literature to date as it deserves. Shin (004) s full Bayes method holds even more appeal in the context of state assessments under No Child Left Behind (NCLB). For most state assessments, sub-scores are usually highly correlated. This requires great estimation precision to differentiate between them. Shin s full Bayes method has the potential of enhancing sub-score estimation precision by incorporating various levels of collateral information. Specifically, collateral information regarding examinees that has been widely used in other areas of research can be applied to sub-score augmentation as well. Therefore, it seems worthwhile to tap the full potential of the fully Bayesian approach and examine its capability to enhance the meaning and precision of sub-scores in the context of state assessments. As shown in other areas of literature, the use of collateral information is not restricted to information regarding the test itself; it can be extended to collateral information regarding examinees. Such a broad perspective seems to suggest a need to broaden the concept of augmented sub-scores beyond borrowing information from item responses to other portions of the test. As such, other types of collateral information can very well be incorporated into sub-score estimation without using the multivariate approach. Therefore, it seems sensible to investigate univariate methods to incorporate collateral information into sub-score estimation. The fully Bayesian variant of Kelley s regressed score method seems to hold promise for responding to these new research

54 38 interests. The ways in which these methods were compared in the present study are detailed in the next chapter.

55 39 CHAPTER III. DESIGN OF STUDY Both fully Bayesian Kelley s and Shin (004) s full Bayes methods have the potential for enhancing sub-score precision by incorporating collateral information. According to Bayesian theory, the more information, the narrower the posterior distribution; therefore, the use of collateral information will necessarily lead to greater estimation precision. However, the extent and practical significance of this benefit needs to be determined in real assessment settings. The purpose of the present study was to use real data from state accountability assessments to determine the extent to which the use of three sources of collateral information could improve sub-score estimation precision. The three sources of collateral information under investigation included: (1) information from other sub-scores, () schools that students attended, and (3) school-level scores on the same test taken by previous cohorts of students in that school. Analyses were conducted on the standardized observed score metric. Results were evaluated in light of three comparison criteria, i.e., signal noise ratio (SNR), standard error of estimate (SEE), and sub-score separation index (SSI). In the following sections, the data and the sample used in the study are described. Then the models are specified and operationalized, followed by a detailed description of the comparison criteria. Finally, analysis procedures are laid out, followed by a summary of the study and research questions. 3.1 Data The data used in this study came from an NCLB accountability assessment program from a southern state. This program is intended to assess public school students in grades 3, 5, 6, 7, and 9. For the sake of simplicity, only Grade 5 was under investigation in this study. The assessment program consists of four content areas: English language arts (ELA), mathematics, science, and social studies. Only the two main tests were chosen to be investigated in this study: English language arts (ELA) and

56 40 mathematics. ELA consists of four strands: reading, writing, language convention, and using information resources (UIR). Mathematics consists of six strands: numbers and number relations (NNR), algebra, measurement, geometry, data analysis, probability, and discrete math (DPD), and patterns, relations, and functions (PRF). Both tests contain a mixture of multiple choice (MC) items and constructed response (CR) items. Due to some practical reasons, the writing strand was not included in the present study. One of the reasons, elaborated in more detail in the following section, was that the two ELA forms used in the study were identical except in writing prompts. To take advantage of this clean design of two otherwise-identical forms, I chose to remove the writing strand from the present study and focus on the other three strands that share the same items across two forms. Tables 3.1 and 3. provide a summary on some important test design characteristics for ELA and mathematics, respectively. Reliabilities were estimated for the 007 and 006 samples using stratified alpha coefficient for total test scores and Cronbach s alpha coefficient for sub-scores. Tables 3.3 and 3.4 provide correlations and disattenuated correlations between strand scores for ELA and mathematics, respectively. All statistics were calculated for both the 007 and 006 samples. As can be seen from the tables, comparisons between the two years did not appear to show much difference in reliabilities. However, there seemed to be a general trend that sub-scores tended to be more highly correlated for the 007 sample than for the 006 sample. This would have implications for assessing the effects of using previous year data as prior information in the study. In addition, ELA and mathematics were shown to have quite different test design characteristics. ELA sub-scores seemed to have relatively high reliabilities by themselves, especially for reading and language convention, whereas the sub-score reliabilities for mathematics ranged from 0.46 to 0.69, with three of them being around the low 0.50s. The disattenuated correlations for ELA sub-scores were moderate ranging from 0.74 to 0.78, suggesting that the constructs as measured by ELA

57 41 strands were correlated but distinct. However, the disattenuated correlations for mathematics strands were mostly in 0.90s, suggesting that the constructs as measured by different math strands were almost indistinguishable statistically. Haberman (007) suggested two criteria that could be used to predict sub-score value: sub-score reliability and subtest correlation. ELA and mathematics tests seemed to differ considerably in terms of these two criteria. Therefore, the comparison between ELA and mathematics could possibly shed some light on whether and how the effects of using information from other sub-scores might differ for tests with different design characteristics. Additionally, the two tests chosen in this study were also typical of other assessments implemented at either the state or district level. For example, the tests reviewed in Wainer et al. (001) s study could be classified into two general categories: tests designed to have relatively reliable subtests with each carrying a distinct piece of information, and tests composed of small subscales that are highly correlated with one another. In the present study, ELA was quite typical of the former, while math was a clear representation of the latter. Inclusion of both ELA and math for investigation was intended to help enhance the generalizability of the results to other assessments that were similar in terms of content, sub-score lengths, reliabilities and correlations. 3. Sample The sample used in this study was two cohorts of students from Grade 5 who took the assessments in years of 006 and 007, respectively. The 007 cohort was the focus of the study, while the 006 cohort provided prior information. In 007, half of the student population was administered Form A, and the other half Form B. In 006, all students were administered Form A. Only the students who were administered Form A were chosen for the analysis. This was done for two main reasons: (1) no equating needed to be conducted due to the use of the same form, and () strands were composed of exactly the same items across two years, thus avoiding some complications such as

58 4 differences in construct representation and/or item difficulties at the strand level had different forms been involved. To reduce the sample to a manageable size, stratified cluster sampling was used to obtain the sample for the 007 cohort. Stratification was used to assure representation of the population in terms of ethnicity, free lunch status, and school size. School size is defined as small, medium, and large according to the number of students at that school in the 007 cohort. Specifically, small schools are defined as below 50, medium schools as between 50 and 99, and large schools as 100 and above. Ethnicity and free lunch status were important because they are indicative of students social economic status. School size was considered in the study not only because schools of different sizes may systematically differ in their curricula, resources, teacher quality, etc., but more importantly, the weight given to prior test information was directly related to school size in the study. Therefore, it was considered important to investigate the extent to which the effects of using previous year data might differ for schools of different sizes. Stratified cluster sampling was used in the sense that schools were selected using stratified sampling, and then all the students within the selected schools were included in the sample. This was done to ensure sufficient sample size for each school. After schools were sampled for the 007 cohort, students from the matching schools were chosen for the 006 cohort. Tables 3.5 and 3.6 provide comparisons between the population and the sample with regard to descriptive information on the stratum variables as well as other demographic variables for 007 and 006, respectively. Tables 3.7 and 3.8 provide the corresponding comparisons regarding ELA and mathematics raw scores at both total test and strand levels for 007 and 006, respectively. Tables 3.5 through 3.8 clearly show that the sample was representative of the population in terms of both stratum variables and test scores.

59 Current Score Reporting It is important to know how scores are currently reported for this state assessment program to suggest ways of making improvements. For the assessment program under investigation, as is required by NCLB legislation, scores are reported at both test level and strand level. At the test level, scale scores are reported along with a margin of error. In addition, students are classified into five performance categories in relation to state standards: advanced, mastery, basic, approaching basic, and unsatisfactory. The cut scores used for such a taxonomy were determined through standard setting and expressed at the IRT theta metric of the 006 assessment. A modified Bookmark procedure was used for standard setting. At the strand level, however, only percent correct scores are reported without reference to any norm reference groups or performance levels. 3.4 Model Specification and Operationalization To examine incremental effects of the three sources of collateral information, six models were constructed: (1) Model A: fully Bayesian Kelley s regressed score model, a non-augmented model that estimates sub-scores based on items only within the strand of interest, () Model B: Model A plus schools attended, (3) Model C: Model A plus both schools attended and previous year school-level sub-score information, (4) Model D: Shin (004) s multivariate fully Bayesian model which incorporates information from other strands, (5) Model E: Model D plus schools attended, and (6) Model F: Model D plus both schools attended and previous year school-level sub-score information. All models are Bayesian models and implemented in WinBUGS (Spiegelhalter, Thomas, Best, and Lunn, 004), in which the Markov chain Monte Carlo method is used as the estimation method. WinBUGS codes for these models are provided in Appendix A. Model A is the baseline model that does not incorporate any of the three types of collateral information under investigation. A fully Bayesian variant of Kelley s regressed score model was chosen as the baseline model because it is a Bayesian model and thus

60 44 results could be more comparable with those from the other five models. Model A estimates sub-scores using information from item responses only to that particular strand. Models B through F are models augmenting sub-scores using collateral information. Model D is Shin (004) s full Bayes model. Models A and D are described in detail in Sections..3 and..4, respectively. The remaining four models are described and operationalized as follows Model B Model B is an adaptation of full Bayesian Kelley s model (Model A) to accommodate hierarchical data structures. This is accomplished by assuming that individual true sub-scores follow a normal distribution with a group-specific mean and a group-specific variance. Formally, it is realized by replacing Equation (.10) with the following: ( τ, σ ) p ~ N g τg τ, (3.1) where τ g is a group-specific true score mean, and σ τg is a group-specific true score variance (i.e., within-group variance). In the present study, groups refer to schools. Further, the following conjugate priors are used for these two parameters: and ( µ τ, σ ) ~ µ τ g N, (3.) 1 ( g ) ~ Gamma( α1, β1), (3.3) σ τ where µ τ is the overall mean, σ µ is the variance for group means (i.e., between-group variance), and α1and β 1 are constants chosen to reflect a non-informative prior. To complete the fully Bayesian model specification, the following conjugate priors are placed on µ τ andσ µ, respectively: and µ ( µ σ ) τ ~ 0, µ 0 N, (3.4)

61 45 1 ( σ µ ) ~ Gamma( α, β ). (3.5) whereµ 0, σµ 0, α and β are constants chosen to represent non-informative priors Model C Model C is an adaptation of Model B to incorporate school-specific informative priors from previous year. This is accomplished by replacing Equations (3.) through (3.3) with the following: ( µ τ, σ ) ~ g µ τ g N, (3.6) and where ( 1 g ) ~ Gamma( α g, β g ), (3.7) σ τ µ τg comes from school-specific means ( τ g ) for the previous year, and α g and based on school-specific variances ( σ ) for the previous year. Further, an informative prior is placed on σ µ : τg β g ( ) 1 σ µ ~ Gamma( α, β ), (3.8) where α and β are also calculated based on between-school variances ( σ ) from the previous year. It should be noted that the use of subscript g in Equations (3.6) and (3.7) allows for incorporating school-specific prior information. µ Model E Model E is an adaptation of Shin (004) s full Bayes model (Model D) to accommodate hierarchical data structures. This is accomplished by assuming that individual true sub-scores follow a multivariate normal distribution with a group-specific mean and a group-specific variance/covariance matrix. Formally, it is realized by replacing Equation (.16) with the following: τ τ pv pm τ ~ MVN τ gv gm στ, στ gv gvm σ σ τgmv τgm, (3.9)

62 46 where τ τ gv gm is a group-specific mean vector, and στ στ gv gvm σ σ τgmv τgm is a group-specific variance/covariance matrix. Further, the following semi-conjugate priors are used for these two parameters: and τ τ gv gm µ v σ ~ MVN, µ m σ µ µ v vm σ σ µ mv µ m, (3.10) σ στ τgv gvm σ σ τgmv τgm 1 ( R, ) ~ Wishart ν, (3.11) 1 1 µ v σ µ v σµ mv where is the overall mean vector, and is the variance/covariance µ m σµ vm σµ m τ gv matrix for group means,. Following a similar logic as explained in Section..4, τ gm R1 /υ 1 customarily represents one s best estimate for the group-specific covariance matrix; here it refers to within-group variance/covariance. Also, that estimate is based on υ 1 observations. To complete the fully Bayesian model specification, the following semi-conjugate µ v σ µ v σµ mv priors are placed on and, respectively: µ m σµ vm σµ m µ v µ m µ 0 ~ MVN µ 0 v m σ 0, σ 0 v vm σ σ 0mv 0m, (3.1) and σ σ µ µ v vm σ σ µ mv µ m 1 ( R, ) ~ Wishart ν, (3.13)

63 47 µ 0v σ 0v σ 0mv where and are constants chosen to represent non-informative priors µ 0m σ 0vm σ 0m for µ v, and R /υ represents one s best estimate for between-group µ m σ µ v σµ mv variance/covariance,. σµ vm σµ m Model F Model F is an adaptation of Model E to incorporate school-specific informative priors from previous year. Different from Model C, however, the school-specific prior information includes not only school-specific means and variances, but also schoolspecific covariances. This is important because it allows school priors to play a role in influencing the posterior covariance structure which will, in turn, influence how much information should be borrowed from other sub-scores. Similar to Model C, Model F is adapted from Model E in the following steps. First, Equations (3.10) and (3.11) are replaced with the following: and τ τ gv gm µ ~ MVN µ gv gm σ, σ µ µ v vm σ σ µ mv µ m, (3.14) σ σ τgv τgvm σ σ τgmv τgm 1 ( R, ) ~ ν Wishart 1g 1g, (3.15) µ gv τ gv where comes from school-specific mean vectors µ for the previous year, and gm τ gm σ τgv στ gmv R 1 g and ν 1g are based on school-specific variances/covariances ( ) and στ gvm στ gm school sample size for the previous year. Further, an informative prior is placed on σ µ v σµ mv : σµ vm σµ m

64 48 σ σ µ µ v vm σ σ µ mv µ m 1 ~ Wishart( R, ν ), (3.8) σ µ v σµ mv where R is based on between-school variances/covariances ( ) from the σµ vm σµ m previous year, and ν based on the number of schools from the previous year. 3.5 Model Prior Specification In a Bayesian analysis, choosing priors is critical as choice of priors could affect convergence and/or parameter estimation. This section is centered on three main issues regarding prior specification for the models used in this study so as to provide guidance for future replications. The first issue is concerned with the unidentified in the likelihood problem as pointed out in Section.4. Specifically, the data do not have sufficient information to separate true score variance from error variance. This should not come as a surprise because in the field of measurement, when only one test administration is involved, error variance is usually estimated as the variance resulting from replications of items as in the case of Cronbach s alpha; thus item-level data are required to estimate error variance. Also recall that in Wainer et al. (001) s empirical Bayes method, this problem is solved by providing Cronbach s alpha as a reliability measure, which serves to separate true score variance from error variance. In a fully Bayesian framework, fortunately, one can choose to solve the identifiability problem by adding constraints or using proper priors. The former is equivalent to providing a reliability measure so as to fix either true score variance or error variance to a reasonable value; the latter is to specify proper priors so as to produce acceptable convergence of the posterior. It was the latter approach that was espoused in the present study. It was accomplished by placing a tight prior on error

65 49 variance. Specifically, in all the six models, an informative prior Gamma ( α, β ) is placed on the inverse of error variance, ( σ, where α and β are chosen so that the mean 1 e ) ( α / β ) is equal to the inverse of the product of variance and 1 ρ x x of the strand of interest, and the variance ( α /β ) is small enough to produce acceptable convergence. The second issue has to do with specifying non-informative priors. In the models where no previous test information is available, such as Models A, B, D, and E, noninformative priors are used. Generally speaking, the non-informative priors in these models can be classified into four types: Gamma ( α, β ) for the inverse of variance, N ( µ, σ ) for mean, µ v σ MVN, µ m σ µ µ v vm σ σ µ mv µ m for a mean vector, and Wishart ( R, υ) for variances/covariances. Conventionally, for the first three priors, the parameters are set to the following constants to reflect lack of prior knowledge: Gamma (.0001,.0001), N (0,10 6 ), and 0 1 MVN, Note that WinBUGS parameterization is adopted here. For Wishart ( R, υ), to make it a non-informative prior, υ is usually set to p +1+1, and R is chosen so that R / υ reflects one s best estimate for the covariance matrix before seeing the data. The third issue is related to the procedures used to specify priors based on the previous year data. Models C and F use school-specific means and variances/covariances from the previous year to construct priors for the current year. Specifically, for ( µ τ, σ ) ~ g µ τ g N in Model C and τ τ gv gm µ ~ MVN µ gv gm σ, σ µ µ v vm σ σ µ mv µ m in Model F, µ τg

66 50 and µ µ gv gm come from school-specific means for the previous year. For ( 1 g ) ~ Gamma( α g, β g ) σ τ in Model C, α g and β g are based on school-specific σ σ τgv τgmv variances for the previous year. For ~ ( R, ν ) τgvm σ σ τgm 1 Wishart 1g 1g in Model F, ν 1g N g is set to 1 +, where N g is the sample size for school g according to the previous year data, and R g 1g 1 /υ is equal to the variance/covariance estimates for school g from the previous year. 3.6 Graphical Representation of Models This section provides graphical representation of the six study models to facilitate understanding of how basic models evolve into more sophisticated models to incorporate collateral information. For ease of illustration, only two sub-scores are used for the multivariate models. Figures 3.1 through 3.3 present the univariate models, while the multivariate models are depicted in Figures 3.4 through 3.6. These graphs are called directed acyclic graphs (DAGs; Thulasiraman & Swamy, 199), which can be used to visually present model structure. To better understand the model structure, a brief explanation of the three essential elements of DAGs is provided as follows. DAGs consist of three elements: nodes, edges and plates. There are three types of nodes: stochastic, logical, and constant. A stochastic node is associated with a density (e.g., in Model D, x [ i,] ~ N( tau[ i,], precx[]) ); a logical node is associated with a link, which is a logical function (e.g., in Model D, sigma. e< 1/ precx ); a constant node is a node which needs to be specified in the data file as fixed values (e.g., in Model D, fixed values need to be provided for R [,] ). Graphically, an oval is used to represent a stochastic node or a logical node, whereas a rectangle is used to represent a constant

67 51 node. As such, constant nodes can clearly be separated from the other two types of nodes. Edges are manifested graphically as arrows pointing from parent nodes into child nodes, where parent nodes refer to the nodes for parameters of the distribution. Conventionally, a single arrow is used to connect stochastic nodes, and a double arrow is used to connect logical nodes. For example, in Model D, a single arrow is used between x [i,] and tau [i,], while a double arrow is used between precx [] and sigma.e[]. Consequently, a logical node can be distinguished from a stochastic node by the type of arrow used. Plates denote repetition, which are particularly useful in hierarchical models. As these figures show, Models A and D have one plate representing the repetition for persons, whereas the remaining four models have an extra plate representing the repetition for schools. The first and foremost contrast is between univariate and multivariate models. Incorporation of collateral information from other sub-scores is achieved via the multivariate framework by modeling the correlations among sub-scores, which is evident from the presence of the parameter prectau [,]. Another sharp contrast is clearly noted when non-hierarchical models (Model A and Model D) and hierarchical models (Models B, C, E and F) are compared. Besides the obvious difference in the repetition plates as mentioned above, they also differ in distributional assumptions. Let me illustrate it by comparing two multivariate models, Models D and E. Model D assumes true sub-scores for person i (i.e., tau [i,] ) to be distributed as a normal distribution with mean mu [] and variance/covariance ( 1 prectau [,]), where [] mu and ( prectau[,]) 1 denote an overall mean vector and an overall variance/covariance matrix for the entire sample, respectively. In contrast, Model E assumes true sub-scores for person i (i.e., tau. stud[ i[ j],] ) to be distributed as a normal distribution with mean tau. schl[ j, ] and variance/covariance ( 1 prectau [ j,,]), where. schl[ j, ] tau and ( 1 prectau [ j,,]) denote a school-specific mean vector and a school-specific variance/covariance matrix, respectively. Gelman and Hill (007, p. 393) point out that the group-level model can be interpreted as prior information for the parameters in the individual-level model.

68 5 Consequently, incorporation of collateral information from schools is achieved via the algorithm that tau. schl[ j, ] serves as an informative prior for estimating tau. stud[ i[ j], ]. It is also worth noting that by adding schools as a hierarchy, Model E also addresses Simpson s paradox with respect to schools by allowing true scores for each person to follow a normal distribution with a mean and a variance/covariance matrix specific to his/her own school. A less obvious contrast lies between models incorporating differing levels of collateral information regarding schools (i.e., Model B vs Model C, and Model E vs Model F). Again, let me illustrate it by comparing two multivariate models, Models E and F. Model E assumes true sub-scores for schools j (i.e., tau. schl[ j, ]) to be distributed as a normal distribution with an undifferentiated mean mu [], whereas Model F assumes them to be distributed as a normal distribution with mean mu [ j, ], which is differentiated by schools. Such differentiation in the latter model is made possible by incorporating collateral information from previous year data; school sub-score means from the previous year are available to be fed into the parameter mu [ j, ] for the current year. A similar illustration can be made with school-level variance/covariance matrix ( 1 prectau [ j,,]). Model E assumes prectau [ j,, ] to be distributed as a Wishart distribution with an undifferentiated R 1[, ], whereas Model F assumes them to be distributed as a Wishart distribution with school-differentiated R 1[ j,, ]. School-level variance/covariance matrices ( 1 prectau [ j,,]) from the previous year are used to construct 1[ j,, ] R for the current year. Evidently, such an accumulative use of multiple-year test information can potentially help accentuate school-specific sub-score profiles if any consistent trend exists. 3.7 Comparison Criteria Three comparison criteria were included in the study: signal/noise ratio (SNR), standard error of estimate (SEE), and sub-score separation index (SSI). The first two

69 53 were concerned with estimation precision of sub-scores, while the last criterion was used to assess sub-score profile variability. They are defined and operationalized as follows Signal/Noise Ratio Signal/noise ratio (SNR) refers to the ratio of signal power to noise power (Cronbach and Gleser, 1964). In the present study, the signal refers to the person s true score and the signal power the true score variance; the noise refers to the error associated with the true score and the noise power the error variance. In the Bayesian paradigm, the true score is estimated by the mean of the posterior distribution of the sub-score of interest; the true score variance is operationalized as the posterior variance of true subscore estimates ( σ ). Similarly, the error variance is operationalized as the variance of τˆ the posterior distribution of the sub-score of interest ( σ E (τˆ) ). A posterior distribution is generated for each person; therefore, each person has a conditional error variance. For the sake of summary, an overall index for the error variance was obtained by averaging all the conditional error variances over all persons. Therefore, SNR for a particular sub-score is operationalized as the ratio of the variance of posterior means and the average posterior variance across all persons of the particular sub-score, the formula of which is provided as follows: ˆ σ ˆ τ σ ˆ τ SNR= =. n σ 1 E ( ˆ) τ σ E( ˆ) τ i n i= Standard Error of Estimate Standard error of estimate (SEE) is defined as standard deviation of the posterior distribution. For similar reasons as described above, an overall index for SEE was obtained by averaging the conditional error variances over all persons and then taking square root. Its formula is as follows:

70 54 n 1 SEE ˆ = σ E( ˆ) τ. i n i= Sub-score Separation Index Sub-score separation index (SSI) is a measure to assess sub-score performance in terms of detecting sub-score differences within a person (Tate, 004). In the present study, SSI is operationalized as the percentage of times that the ratio of the estimated individual sub-score difference to the sum of the standard errors for the two sub-scores is greater than one. This definition is consistent with the common practice used to assess sub-score differences, which is to place confidence intervals around sub-score estimates and assess whether they overlap. In a Bayesian context, the standard error of a sub-score is obtained by taking standard deviation of the posterior distribution (i.e., SEE). SSI is an important index for the study because augmentation has two opposite effects: one is to enhance sub-score precision and the other to wipe out sub-score differences. SSI can assess the extent the extent to which precision is enhanced without eliminating the differences. 3.8 Analysis Procedures Sub-score estimates were obtained by standardizing the number correct scores for each strand, which were used as input for Models A through F. For the 006 estimation, non-informative priors were used for Models C and F, the output of which was used to compute priors for the 007 estimation of the corresponding models. Four outputs from the WinBUGS program were used to evaluate convergence: descriptive statistics of the posterior distribution, history plots, autocorrelation plots, and Brooks, Gelman, and Rubin (BGR) diagnostic plots. After convergence was deemed acceptable, comparison criteria were calculated for the six models.

71 Issues in Implementing MCMC Methods The rationale for MCMC methods is that the sequence of states for the Markov chain should theoretically converge to a stationary distribution such that the sampled observations can be viewed as a sample from the posterior distribution of the model parameters (Kim and Bolt, 007). It would be critical, therefore, to evaluate chain convergence to ensure that the samples on which inferences are based are truly representative of the underlying stationary distribution of the Markov chain. A variety of diagnostic tools have been developed to assess convergence. Interested readers are referred to Cowles and Carlin (1996) for a review on convergence diagnostics. In the following, a brief introduction is only given to several diagnostic criteria that have been widely applied in practice and can be implemented without the aid of additional computer programs. Lack of convergence can sometimes be apparent from an inspection of the history of the chain. For example, the top panel of Figure 3.7 shows a high likelihood of convergence, whereas the bottom panel demonstrates non-convergence. This often serves as a quick check of convergence, and is directly available by clicking the history button of the WinBUGS sample monitor tool. Another diagnostic criterion is the autocorrelation between parameter values sampled at successive states in the chain. Figure 3.8 shows examples of autocorrelation plots. The vertical axis is the autocorrelation, and the horizontal axis is the lags. Autocorrelation plots show the autocorrelation function of the variable to 50 lags. Low autocorrelation is desired, whereas high autocorrelation indicates either a slow mixing sampler or non-convergence. In the presence of high autocorrelation, a very large number of iterations are required before the sampled states can be viewed as a sample from the posterior (Kim and Bolt, 007). In the left panel of Figure 3.8, the autocorrelation drops close to zero soon after the beginning of sampling, suggesting a low autocorrelation. On the contrary, the right panel shows an example of a high autocorrelation; the

72 56 autocorrelation does not begin to tail off until close to 40 lags. This implies that the Markov chain is not randomly walking across the whole posterior distribution but is stuck somewhere in this posterior distribution and is moving slowly. Such plots are also readily available by clicking the auto cor button of the WinBUGS sample monitor tool. Monte Carlo error is often used as an indicator of whether the chain is long enough to be considered convergent. It occurs because the posterior distributions are constructed from samples; it is a form of sampling error. It can influence the posterior inference along with the standard error of the point estimate (as reflected by the standard deviation of the posterior). As a rule of thumb, the simulation should be run until the Monte Carlo error for each parameter of interest is less than about 5% of the sample standard deviation (Spiegelhalter et al., 003). It is important to note that the Monte Carlo error can always be reduced by lengthening the chain. Such information can be obtained by clicking the stats button of the WinBUGS sample monitor tool. Another convergence diagnostic statistic is Gelman and Rubin (199) s method. Different from the previous three criteria, this statistic is obtained based on multiple chains which are simulated with initial values that are overdispersed with respect to the target distribution. It is estimated as R ˆ = n 1 m+ 1 ( + n mn B W df ), df where B is the variance between the means from the m parallel chains, W is the average of the m within-chain variances, and df is the degrees of freedom of the approximating t density. When Rˆ approaches 1, the pooled within-chain variance dominates the between-chain variance, which can be interpreted as evidence that all chains have escaped the influence of their starting points and have traversed all of the target distribution (Cowles and Carlin, 1996). This statistic is provided in the BGR diagram, which can be obtained by clicking the bgr diag button of the WinBUGS sample

73 57 monitor tool. Figure 3.9 shows examples of BGR diagrams. The three lines represent the pooled variance, the within-chain variance, and their ratio, respectively. Rˆ is depicted by the red line. The left panel shows convergence; the right panel shows Rˆ is close to 1, suggesting a high likelihood of Rˆ ranges from around 5 to 10, which is strong evidence against convergence. The four diagnostic criteria described above are among the most popular tools for evaluating convergence in the Bayesian community, at least in part because they are directly available from WinBUGS. However, it should be noted that even satisfaction of the above criteria should not be taken as a guarantee of convergence. Besides assessing convergence, there are also some other issues that are important in implementing MCMC methods. One of them is to decide how many chains to run. There is no agreement between using one very long chain and several long chains. The several-long-runs school argues that comparing several seemingly converged chains might reveal genuine differences if the chains have not yet approached stationary. The one-very-long-run school argues that one very long run has the best chance of finding new modes, and comparison between chains can never prove convergence (Gilks et al., 1995). Specifically, they point out that if one compares, say, a single chain run for 10,000 iterations with 10 independent chains each run for 1,000 iterations, then the last 9,000 iterations from the single long chain are all drawn from distributions that are likely to be closer to the true target distribution than those reached by any of the shorter chains (Cowles and Carlin, 1996). Furthermore, running multiple chains is considered very inefficient when early iterations need to be discarded as burn-in samples for each chain. Therefore, the present study chose to run a single long chain to obtain posterior distributions while still using multiple chains to compute convergence diagnostic statistic, Rˆ.

74 Summary The purpose of the study was to determine the extent to which the use of three sources of collateral information could improve sub-score estimation precision for real state accountability assessments. The three sources of collateral information under investigation included: (1) information from other sub-scores, () schools that students attended, and (3) school-level scores on the same test taken by previous cohorts of students in that school. To examine incremental effects of these three sources of collateral information, six models were constructed: (1) Model A: fully Bayesian Kelley s regressed score model, a non-augmented model that estimates sub-scores based on items only within the strand of interest, () Model B: Model A plus schools attended, (3) Model C: Model A plus both schools attended and previous year school-level sub-score information, (4) Model D: Shin s multivariate fully Bayesian model that incorporates information from other strands, (5) Model E: Model D plus schools attended, and (6) Model F: Model D plus both schools attended and previous year school-level sub-score information. Models B, C, E and F are Bayesian hierarchical models with schools as a hierarchy. Such hierarchical models can provide posterior distributions for school-level statistics, which allows for conducting the same types of analyses on schools as on students. This was considered very important information for two main reasons: (1) under NCLB, school performance is in and of itself interesting to various stakeholders (e.g., to evaluate school curricular effectiveness, to pinpoint the strengths and weaknesses in school-wide instruction), and () sub-scores are more reliable at the school level than at the individual level, and thus may be able to offer diagnostic value even when individual sub-scores fail to (Haberman, Sinharay, and Puhan., 006; Sinharay, Haberman, and Puhan, 007). Three comparison criteria were used to evaluate the results: signal/noise ratio, standard error of estimate, probability of accurate classification, and sub-score separation index. The first two are global measures of sub-score precision; the last assesses the

75 59 extent to which augmentation procedures could enhance sub-score precision without eliminating sub-score differences. Both English language arts and mathematics were investigated because they exhibit different design characteristics in terms of sub-score correlations and reliabilities. Therefore, it would be interesting to determine whether the effects of using information from other sub-scores differ for these two tests. Three school sizes were used to sample schools: small, medium, and large. School size was considered in the study mainly because the weight given to prior test information was directly related to school size. Therefore, it was considered important to investigate the extent to which the effects of using previous year data might differ for schools of different sizes. Specifically, the following research questions were addressed in the study: 1. What effects does the use of three sources of collateral information have on student-level sub-score estimation in terms of signal/noise ratio, standard error of estimate, and sub-score separation index? Do the findings differ for different tests (i.e., ELA and math)?. For the four hierarchical models (i.e., Models B, C, E and F), what effects does the use of collateral information have on school-level sub-score estimation in terms of signal/noise ratio, standard error of estimate, and sub-score separation index? Do the findings differ for different tests (i.e., ELA and math)? 3. For the four hierarchical models (i.e., Models B, C, E and F), how do sub-score estimates compare at the student- and school-levels in terms of the three criteria? Do the findings differ for different tests (i.e., ELA and math)? 4. Do the effects of using collateral information differ for schools of different sizes?

76 60 Table 3.1 Test design characteristics for ELA Test components Number of points Number of MC items Number of CR items Reliability coefficient Total Test (Writing excluded) Strands Reading Language Convention UIR Note: UIR refers to use of information resources.

77 61 Table 3. Test design characteristics for mathematics Test components Number of points Number of MC items Number of CR items Reliability coefficient Total Test Strands NNR Algebra Measurement Geometry DPD PRF Note: UIR, DPD, and PRF refer to numbers and number relations, data analysis, probability, and discrete math, and patterns, relations, and functions, respectively.

78 6 Table 3.3 Correlations and disattenuated correlations between strands for ELA Reading Language Convention UIR 007 sample Reading Language Convention UIR sample Reading Language Convention UIR Note: The elements at the upper diagonal are disattenuated correlations. UIR refers to use of information resources.

79 63 Table 3.4 Correlations and disattenuated correlations between subtests for mathematics 007 sample NNR Algebra Measurement Geometry DPD PRF NNR Algebra Measurement Geometry DPD PRF sample NNR Algebra Measurement Geometry DPD PRF Note: The elements at the upper diagonal are disattenuated correlations. UIR, DPD, and PRF refer to numbers and number relations, data analysis, probability, and discrete math, and patterns, relations, and functions, respectively.

80 64 Table 3.5 Descriptive information for stratum variables and demographic variables for the 007 sample Population Sample N Percentage N Percentage Gender Female % % Male % % Ethnicity Black % % White % % Others 106 5% % Free lunch status Paid 781 3% % Reduced % % Free 089 9% % Limited English proficiency No % % Yes % 150.6% School size Large (100+) 58 16% 15 17% Medium (50 to 99) % 36 4% Small (10 to 49) 16 43% 35 41%

81 65 Table 3.6 Descriptive information for stratum variables and demographic variables for the 006 sample Population Sample N Percentage N Percentage Gender Female % % Male % % Ethnicity Black % % White % % Others % % Free lunch status Paid % % Reduced % % Free % % Limited English proficiency No % % Yes % % School size Large (100+) % 15 17% Medium (50 to 99) % 36 4% Small (10 to 49) % 35 41%

82 66 Table 3.7 Descriptive information for ELA and mathematics raw scores for the 007 sample Population Sample Mean Std Dev Mean Std Dev ELA Total Reading Language Convention UIR Mathematics Total NNR Algebra Measurement Geometry DPD PRF Note: UIR refers to use of information resources. UIR, DPD, and PRF refer to numbers and number relations, data analysis, probability, and discrete math, and patterns, relations, and functions, respectively.

83 67 Table 3.8 Descriptive information for ELA and mathematics raw scores for the 006 sample Population Sample Mean Std Dev Mean Std Dev ELA Total Reading Language Convention UIR Mathematics Total NNR Algebra Measurement Geometry DPD PRF Note: UIR refers to use of information resources. UIR, DPD, and PRF refer to numbers and number relations, data analysis, probability, and discrete math, and patterns, relations, and functions, respectively.

84 Figure 3.1 Graphic representation of Model A: Kelley s model 68

85 Figure 3. Graphic representation of Model B: Kelley s model incorporating school information 69

86 Figure 3.3 Graphic representation of Model C: Kelley s model incorporating school information and previous year school-level sub-score information 70

87 Figure 3.4 Graphic representation of Model D: Shin s model 71

88 Figure 3.5 Graphic representation of Model E: Shin s model incorporating school information 7

89 Figure 3.6 Graphic representation of Model F: Shin s model incorporating school information and previous year school-level sub-score information 73

90 74 mu iteration prectau[1,] iteration Figure 3.7 Examples of MCMC sampling history plots

91 mu lag mu[4] lag Figure 3.8 Examples of MCMC autocorrelation plots sigmatau chains 1: start-iteration sigmatau chains 1: start-iteration Figure 3.9 Examples of BGR diagnostic plots

92 76 CHAPTER IV. RESULTS Real data from state accountability assessments were used to assess the effects of using different sources of collateral information on estimation precision of sub-scores. Two subject areas were chosen for analysis: English Language Arts (ELA) and mathematics. The standardized score metric was used for analysis. The six proposed models were constructed as specified previously in Chapters II and III. The only exception was Model F for math. When previous year school-specific variance/covariance matrices were fed into the model as priors, some matrices were not positive definite due to some near-perfect correlations, and as a result, the model could not run. Wainer et al. s (001, p. 35) also noted a similar problem with their method, which required S true to be positive definite. Therefore, Model F was modified to incorporate only school-specific means and variances as priors; previous year covariances were not used in Model F for math. This was not the case with ELA because the correlations among ELA sub-scores were lower. Model convergence was evaluated in terms of the four diagnostic criteria proposed in Section 3.9: history plots, autocorrelation plots, Monte Carlo error, and Gelman and Rubin s Rˆ statistic. These statistics suggested a high likelihood of Marchov chain convergence for all the models investigated. Appendix B provides all the diagnostic plots for Models A, B, D, and E. The diagnostic plots for Models C and F were very similar to those for Models B and E; therefore, they were not provided to avoid redundancy. After model convergence was deemed acceptable, comparison criteria were calculated for all six models. This was conducted for both subjects (ELA and mathematics). In the following, results are first presented for ELA followed by math. 4.1 Model Comparison for ELA Comparison criteria for student-level statistics were calculated for all six models, while those for school-level statistics were only calculated for the four hierarchical

93 77 models (Models B, C, E and F). Tables 4.1 to 4.3 have the results for ELA in terms of the three comparison criteria (i.e., SNR, SEE, and SSI), respectively. To facilitate comparisons, the same information is visually depicted in Figures 4.1 to 4.3 for studentlevel results, and in Figures 4.4 to 4.6 for school-level results. Figures 4.1 to 4.3 reveal a reasonably consistent trend regarding the effects of using other sub-scores as collateral information on student-level statistics: incorporation of such collateral information resulted in enhanced estimation precision but reduced profile variability. Specifically, when Shin (004) s and derived models were used, there was a considerable increase in SNR, a sizeable decrease in SEE, and a dramatic drop in SSI. This pattern was most apparent with the third sub-score, which was the least reliable among the three. The effects of using the other two sources of collateral information were less clear. Tables 4.5 to 4.8 reveal a weak but relatively consistent pattern regarding the effects of using school attended as collateral information. Generally speaking, adding schools as a hierarchy in both Kelley s model and Shin s model led to somewhat enhanced SNR (from 5.409, and to 5.719, and in Kelley s model, and from 6.818, and.970 to 7.371, and in Shin s model) and reduced SEE (from 0.363, and 0.49 to 0.355, and in Kelley s model, and from 0.38, 0.31 and to 0.313, 0.30 and in Shin s model), but the effects were not as dramatic as those observed with using other sub-scores as collateral information. No consistent pattern was observed for SSI. With regard to the use of previous year school-level sub-score information, the results were mixed and no consistent pattern could be extracted for the univariate model. This was probably because previous year sub-score information was implemented at the school level, rather than at the student level; consequently, such collateral information only had an indirect impact on student-level results through school-level results. For the multivariate model, a seemingly aberrant pattern was observed; using previous year school-level sub-score information led to reduced estimation precision (e.g., SNR decreased from 7.371, 8.316

94 78 and to 7.310, 8.83 and 3.713) and increased profile variability (i.e., SSI increased from 0.137, and to 0.159, and 0.100). How did this happen? A comparison of sub-score correlations across the two years may provide an answer. Table 3.3 shows that correlations were lower for the 006 sample than for the 007 sample. Consequently, when school-specific sub-score correlation information from the 006 sample was incorporated, this resulted in lower correlations used in the sub-score augmentation algorithm, and thus less information being borrowed from other sub-scores. Higher SSI was also a direct result of the lowered correlations used in the algorithm. For school-level results, two types of effects merit our attention: (1) the effect of using information from other sub-scores (Models B vs E), and () the effect of incorporating previous year school-level sub-score information into the univariate and multivariate models, respectively (Models C vs B, and Models F vs E). Consistent with student-level results, the use of information from other sub-scores resulted in enhanced SNR and reduced SEE and SSI. Specifically, when Models B and E were compared, SNR increased from 1.179, and 5.35 to , and 7.963; SEE decreased from 0.15, and 0.18 to 0.119, 0.11 and 0.11; SSI decreased from 0.44, 0.30 and 0.91 to 0.09, 0.79 and Different from student-level results, however, the use of previous year school-level sub-score information yielded a strong and consistent pattern for both univariate and multivariate models. It led to enhanced SNR and SSI as well as reduced SEE. For the univariate model, SNR increased from 1.179, and 5.35 to , 15.0 and 7.787; SEE decreased from 0.15, and 0.18 to 0.110, and 0.111; SSI increased from 0.44, 0.30 and 0.91 to 0.91, and For the multivariate model, SNR increased from , and to , and 9.567; SEE decreased from 0.119, 0.11 and 0.11 to 0.114, and 0.106; SSI increased from 0.09, 0.79 and 0.33 to 0.56, and This finding was anticipated in the sense that such collateral information was implemented at the school level as priors, and thus was expected to have direct and possibly strong impact on

95 79 school-level results, although not necessarily on student-level results. However, it also appears perplexing in the sense that it not only generated scores with more estimation precision (higher SNR and lower SEE), but also increased their profile variability (higher SSI). Recall that using information from other sub-scores resulted in lower SSI, while the impact of using information from schools attended on SSI was negligible and inconsistent. However, this finding was deemed reasonable because it was consistent with the hypothesis that if there was a pattern of strengths and weaknesses in a school s curriculum, such a pattern would be accentuated and manifested through the accumulative use of test information across years. Recall that Models C and F accomplish such an accumulation through incorporating school sub-score profiles from the previous year as priors when estimating sub-scores for the current year. When student-level and school-level sub-scores are compared, school-level subscores tended to have much higher estimation precision, which is reflected by larger SNR and smaller SEE. This finding directly flows from the central limit theorem. However, there was no such consistent pattern regarding SSI. To understand this finding, it is important to know that student-level SSI and school-level SSI are two different concepts: the former refers to students personal sub-score profile, whereas the latter refers to a school s curricular profile. When students within the same school show a similar pattern of strengths and weaknesses, such a pattern will manifest itself at the school level as a curricular profile. In the absence of such a pattern among students in the same school, there could be a small SSI at the school level despite a large SSI at the student level. Therefore, there is no theory dictating the relative magnitude of student-level SSI and school-level SSI because they refer to different concepts. To examine the incremental effects of using differing levels of collateral information, ratios of comparison criteria across models were obtained for both studentlevel and school-level results. Nine comparisons were identified for student-level results, and two comparisons for school-level results, each representing the use of one, two, or

96 80 three sources of collateral information in either Kelley s model or Shin s model. For example, Model D is an extension of Model A by allowing sub-scores to be correlated with one another; therefore, by comparing Models A and D, we can examine the effects of using information from other sub-scores. An example of examining the effects of more than one source of collateral information is to compare Models D and F, where Model F incorporates collateral information from both schools attended and previous year schoollevel sub-score information; therefore, comparing these two models can reveal the effects of incorporating these two sources of collateral information. It should also be noted that using the same source of collateral information in different models has different implications. For example, adding a school hierarchy to Kelley s model or Shin (004) s model has different effects on the augmentation algorithm: multi-level Kelley s model differs from Kelley s model only by using school means and variances, whereas multilevel Shin s model also involves using school covariances. Another example is that incorporating previous year school-level sub-score information in the univariate model means incorporating previous year school means and variances, whereas in the multivariate model it means incorporating previous year school means and variances/covariances. It is important to bear these differences in mind when assessing effects of collateral information in comparing models. Tables 4.4 to 4.6 show ratios of criteria for the nine comparisons for student-level results and the two comparisons for school-level results. Ratios were obtained by dividing comparison criteria for more sophisticated models by those for more basic models. Figures 4.7 to 4.9 focus on comparing all the five subsequent models with the baseline model for student-level results. For student-level results, ratios for most comparisons concerning the use of collateral information from other sub-scores and schools attended turned out to be greater than one for SNR and smaller than one for SEE. This confirms the finding noted above that the use of these two sources of collateral information resulted in enhanced estimation

97 81 precision. Regarding the use of information from other sub-scores, SNR ratios ranged from 1.10 to.09 and SEE ratios from to Using schools attended led to less improvement in estimation precision, with SNR ratios ranging from 1.04 to 1.39, and SEE ratios from to Additionally, results also show a drastic drop in SSI as a result of using other sub-scores as collateral information, which was reflected by the corresponding SSI ratios ranging from 0.81 to This suggests that only 8.1% to 50.5% of sub-score pairs that exhibited differences based on Kelley s model were identified as having non-overlapping error bands based on Shin s model. No strong or consistent pattern was observed for the use of previous year school-level sub-score information based on Kelley s model. However, when such information was used in Shin (004) s model, it led to reduced estimation precision (SNR ratios from to 0.996; SEE ratios from to 1.053) and increased profile variability (SSI ratios from to 1.441). Tables 4.4 to 4.6 also show the magnitudes of the effects of using different individual sources of or different combinations of collateral information, which could be helpful in determining which collateral information to use. For each individual source of collateral information, the general rank ordering of their effects was: information from other sub-scores > schools attended > previous year school-level sub-score information, which held for both the univariate and multivariate models. For combinations of collateral information, incorporating information from other sub-scores plus schools attended seemed to produce the largest effects. It is important to note that the word effects was used here, which should be interpreted differently from benefits. As will be discussed more fully in Chapter V, it should be score users responsibilities to judge what effects are benefits in their contexts. For school-level results, ratios for both comparisons were greater than one for SNR and SSI, and smaller than one for SEE, suggesting enhanced estimation precision and also increased profile variability. Another noteworthy finding was that incorporating

98 8 previous year school-level sub-score information seemed to produce lesser effects in the multivariate model than in the univariate model. In the univariate model, SNR ratios ranged from to 1.499, SEE ratios from to 0.880, and SSI ratios from to In the multivariate model, SNR ratios ranged from 1.17 to 1., SEE ratios from to 0.960, and SSI ratios from 1.15 to Again, this could be explained by the lowered correlations used in the augmentation algorithm as a result of incorporating previous year covariances in the multivariate model. Figures 4.7 to 4.9 show comparisons of the five subsequent models with the baseline model for student-level results. Results reveal a clear separation between univariate and multivariate models, with the latter having larger estimation precision and smaller profile variability. Tables 4.7 to 4.9 have the three comparison criteria organized to show how the effects of using collateral information differ for schools of different sizes: small, medium, and large. These results were obtained based on the school-level statistics for the four hierarchical models. Generally speaking, across all models, larger school sizes seem to correspond to larger SNR and SSI, as well as smaller SEE. There appears to be greater disparity between large and medium schools than between medium and small schools. This was a direct result of how school size was defined in this study. As is defined in Section 3., small and medium schools refer to schools with students less than 50 and between 50 and 99, respectively, while large schools refer to schools greater than 100 students; therefore, the disparity in the comparison criteria between small/medium schools and large schools can be attributed to the disparity in sample sizes between them. As more collateral information is incorporated, small and medium schools seem to garner larger increases in SNR and larger decreases in SEE than large schools. This finding is consistent with the general finding in the literature that the more unreliable the sub-scores are, the more strength they will borrow from other collateral information (Wainer et al., 001). As Table 4.7 shows, sub-scores tend to be less reliable for small and medium

99 83 schools; therefore, it was anticipated to see them benefit more from the use of collateral information. However, no consistent pattern was observed for SSI. 4. Model Comparison for Math The same models were run for math, except for Model F in which previous year school-specific covariances were not incorporated as priors due to some mathematical constraints placed on matrix computation. The same set of statistics was computed for math. They were analyzed in the same fashion and presented in the same layout as those provided in the previous sections. Tables 4.10 to 4.1 have the results for math in terms of the three comparison criteria (i.e., SNR, SEE and SSI), respectively. Figures 4.10 to 4.1 graphically depict student-level results, and Figures 4.13 to 4.15 show school-level results. For student-level results, Figures 4.10 to 4.1 reveal an increase in SNR and a decrease in SEE and SSI associated with the use of information from other sub-scores. This finding is similar to what was observed with ELA, except that the magnitudes of change are considerably larger. For example, Table 4.10 shows that SNRs ranged from to.615 for the univariate models, while they were between and for the multivariate models. An improvement of this magnitude as a result of using information from other sub-scores can be attributed to the relatively low reliabilities but high correlations among math sub-scores, as can be seen in Tables 3. and 3.4. With such a pattern of reliability and correlation, it is reasonable to see strong augmentation effects resulting from borrowing much information from other sub-scores. Regarding the effects of adding schools as a hierarchy, the picture is more complicated than that for ELA. Generally speaking, for the univariate model, the use of such collateral information resulted in enhanced estimation precision (i.e., larger SNR and smaller SEE), which is consistent with what was observed with ELA. The largest improvement was observed with the fourth strand, for which SNR increased from to.100, and SEE dropped

100 84 from to 0.499, while the smallest improvement was observed with the sixth strand, for which SNR increased from to 1.401, and SEE dropped from to However, for the multivariate model, there is an opposite pattern for most comparisons (i.e., smaller SNR and larger SEE). How did this happen? As was explained in Section 4.1, the use of schools as a hierarchy has different implications on the augmentation algorithm depending on whether it is a univariate or a multivariate model. In a multivariate model, besides using school means and variances as a univariate model does, school-specific covariances are also involved. When the entire sample is disaggregated by schools, the correlations within schools may not be as high as those across the state due to range restrictions. This phenomenon is more manifest when schools vary considerably in teaching a subject area in terms of quality and curricular emphases. Appendix C displays sub-scores correlations by ten selected schools for ELA and math, respectively. It can be seen that school-specific correlations are lower than the state-wide correlations for both ELA and math, but this is much more so for math. This pattern calls into question the validity of using a state-wide correlation matrix to augment scores for all students as was done in non-hierarchical models. The reduction in estimation precision resulting from using schools as a hierarchy could possibly be attributed to the lowered correlations used for augmentation when the entire sample was disaggregated by schools. Such effects were not manifested in ELA results probably because disaggregation did not cause correlations to decrease as much in ELA as in math. No consistent pattern regarding SSI was observed for the univariate model, but it generally led to enhanced SSI for the multivariate model. Increases between and 0.00 were observed for most comparisons, while the largest increase was observed with the first comparison, With regard to the use of previous year school-level sub-score information, the effects were generally very small, as can be seen in Figures 4.10 to 4.1 that the lines representing Models C and F are closely intertwined with those representing Models B and E, respectively.

101 85 For school-level results, the patterns for math results are not as clear cut as those for ELA results in that more exceptions to general trends occurred. Generally speaking, the following rank ordering was observed for most comparisons: Model F > Model C > Model E > Model B for SNR; and Model F < Model C < Model E < Model B for SEE. For SSI, the picture contains much noise, but a somewhat clear pattern is that Model E consistently remains below all the other models, and Model C is above all the other models most of the time. Therefore, it is safe to conclude that the use of information from other sub-scores enhanced estimation precision but reduced profile variability, while the use of previous year school-level sub-score information in both the univariate and multivariate models generally led to enhanced estimation precision as well as increased profile variability. These findings are quite consistent with what was observed with ELA. Models F and C always ranked higher than Models E and B with respect to estimation precision, suggesting that previous year school-level sub-score information was more effective than information from other sub-scores in improving estimation precision at the school level. This finding is also consistent with what was observed with ELA. Another noteworthy finding concerns the general magnitude of SSI for school-level results. The student-level SSI for the non-augmented model (i.e., Model A) was mostly between 0.05 and However, the school-level SSI for all the four hierarchical models was mostly between 0.0 and 0.40, which represents a remarkable increase in profile variability. As was explained in Section 4.1, student-level SSI refers to a student s personal sub-score profile, whereas school-level SSI refers to a school s curricular profile. It can be the case that a profile of strengths and weaknesses is present but not strong enough to manifest itself at the student level due to lack of estimation precision. However, if there are a sufficient number of students within a school showing a similar profile, such a profile will manifest itself as a curricular profile for that school. The high level of estimation precision associated with school-level results allows for detection of profiles that are hard to be detected at the student level. This finding empirically confirms the expectation that

102 86 a number of researchers have placed on school-level sub-scores after they were dismayed by the lack of diagnostic value offered by student-level sub-scores (Haberman, Sinharay, and Puhan., 006; Sinharay, Haberman, and Puhan, 007). This was not the case with ELA, however. The reason for this needs further investigation, especially from a curricular perspective. Tables 4.13 to 4.15 show ratios of criteria for the nine comparisons for studentlevel results and the two comparisons for school-level results. Figures 4.16 to 4.18 focus on comparisons of the five subsequent models with the baseline model for student-level results. For student-level results, patterns observed are very similar to those observed with ELA, but the magnitude of effects is quite different. For the use of information from other sub-scores, SNR ratios ranged from to 5.985, SEE ratios from 0.54 to 0.659, and SSI from to This pattern implies that using information from other subscores greatly enhances estimation precision but almost eliminates profile variability. Such a finding echoes the concerns expressed in the literature: when sub-scores were highly correlated, augmented sub-scores became highly reliable, but they turn out to be virtually the same for any examinee and thus had little diagnostic value (Wainer et al., 000; Wainer et al., 001). This was the case with the math test under investigation. Table 3.4 shows that disattenuated correlations among math sub-scores were mostly in 0.90s, which allowed sub-scores to borrow information from one another to a considerable extent. Consequently, SNRs after augmentation increased three to six times, but sub-score differences also dropped down to almost zero. For adding schools as a hierarchy, results were not only divergent for the univariate and multivariate models, but also somewhat mixed within each model. For the univariate model, estimation precision was generally enhanced but the effects regarding SSI were largely mixed. Specifically, SNR ratios were consistently greater than one, ranging from to 1.401; SEE ratios were consistently smaller than one, ranging from 0.91 to 0.963; and SSI ratios were completely mixed. For the multivariate model, the general pattern was reduced estimation

103 87 precision and slightly increased profile variability. Specifically, four out of six sub-scores showed SNR ratios larger than one (1.065 to 1.483), and SEE ratios smaller than one (0.84 to 0.964). SSI ratios were mostly infinite because the denominators were mostly zero and the numerators were all positive. Regarding the use of previous year schoollevel sub-score information, the effects were generally very small with most ratios for precision criteria hovering between 0.98 and 1.0. Consistent with what was observed with ELA, for each individual source of collateral information, the general rank ordering of their effects was: information from other sub-scores > schools attended > previous year school-level sub-score information. For combinations of collateral information, however, the model incorporating all three sources of collateral information seemed to produce the largest effects for most comparisons. This was different from the finding with ELA, probably because previous year school-level covariances were incorporated in ELA models, but not in math models; consequently, the lower correlations for the 006 sample did not affect the results of Model F for math. For school-level results, generally speaking, both comparisons yielded ratios greater than one for SNR and SSI, less than one for SEE, indicating enhanced estimation precision and also increased profile variability resulting from using previous year schoollevel sub-score information in both the univariate and multivariate models. The effects were presumably comparable between the univariate and multivariate models probably because the collateral information used in these two models was similar. Tables 4.16 to 4.18 show the three comparison criteria by three school sizes: small, medium, and large. Similar to what was observed with ELA, as more collateral information is incorporated, small and medium schools tended to benefit more than large schools in terms of enhanced estimation precision. Regarding SSI, the results were not as consistent, but there was a general pattern that small and medium schools also tended to benefit more than large schools in terms of increased profile variability.

104 88 Table 4.1 Signal noise ratios (student and school levels) for ELA Models Sub 1 Sub Sub 3 Student Level A: Kelley B: Kelley + School C: Kelley + School + Prior D: Shin E: Shin + School F: Shin + School + Prior School Level B: Kelley + School C: Kelley + School + Prior E: Shin + School F: Shin + School + Prior Table 4. Standard errors of estimate (student and school levels) for ELA Models Sub 1 Sub Sub 3 Student Level A: Kelley B: Kelley + School C: Kelley + School + Prior D: Shin E: Shin + School F: Shin + School + Prior School Level B: Kelley + School C: Kelley + School + Prior E: Shin + School F: Shin + School + Prior

105 89 Table 4.3 Sub-score separation indices (student and school levels) for ELA Models Sub 1 Sub 3 Sub 13 Student Level A: Kelley B: Kelley + School C: Kelley + School + Prior D: Shin E: Shin + School F: Shin + School + Prior School Level B: Kelley + School C: Kelley + School + Prior E: Shin + School F: Shin + School + Prior

106 90 Figure 4.1 Student-level signal noise ratio for ELA Figure 4. Student-level standard error of estimate for ELA

107 91 Figure 4.3 Student-level sub-score separation index for ELA Figure 4.4 School-level signal noise ratio for ELA

108 9 Figure 4.5 School-level standard error of estimate for ELA Figure 4.6 School-level sub-score separation index for ELA

109 93 Table 4.4 Comparison of criterion ratios across models for ELA: Signal noise ratios Models Collateral information Sub 1 Sub Sub 3 Student level B vs A School C vs B Prior school means/variances C vs A School + Prior school means/variances D vs A Other sub-scores E vs D School F vs E F vs D Prior school means and variances/covariances School + Prior school means and variances/covariances E vs A Other sub-scores + School F vs A School level Other sub-scores + School + Prior school means and variances/covariances C vs B Prior school means/variances F vs E Prior school means and variances/covariances

110 94 Table 4.5 Comparison of criterion ratios across models for ELA: Standard errors of estimate Models Collateral information Sub 1 Sub Sub 3 Student level B vs A School C vs B Prior school means/variances C vs A School + Prior school means/variances D vs A Other sub-scores E vs D School F vs E F vs D Prior school means and variances/covariances School + Prior school means and variances/covariances E vs A Other sub-scores + School F vs A School level Other sub-scores + School + Prior school means and variances/covariances C vs B Prior school means/variances F vs E Prior school means and variances/covariances

111 95 Table 4.6 Comparison of criterion ratios across models for ELA: Sub-score separation indices Models Collateral information Sub 1 Sub Sub 3 Student level B vs A School C vs B Prior school means/variances C vs A School + Prior school means/variances D vs A Other sub-scores E vs D School F vs E F vs D Prior school means and variances/covariances School + Prior school means and variances/covariances E vs A Other sub-scores + School F vs A School level Other sub-scores + School + Prior school means and variances/covariances C vs B Prior school means/variances F vs E Prior school means and variances/covariances

112 96 Figure 4.7 Student-level signal noise ratio ratio for ELA Figure 4.8 Student-level standard error of estimate ratio for ELA

113 Figure 4.9 Student-level sub-score separation index ratio for ELA 97

114 98 Table 4.7 Comparison of school-level signal noise ratios by school size for ELA Models School size Sub 1 Sub Sub 3 B: Kelley + School C: Kelley + School + Prior E: Shin + School F: Shin + School + Prior Small Medium Large Small Medium Large Small Medium Large Small Medium Large Table 4.8 Comparison of school-level standard errors of estimate by school size for ELA Models School size Sub 1 Sub Sub 3 B: Kelley + School C: Kelley + School + Prior E: Shin + School F: Shin + School + Prior Small Medium Large Small Medium Large Small Medium Large Small Medium Large

115 99 Table 4.9 Comparison of school-level sub-score separation indices by school size for ELA Models School size Sub 1 Sub 3 Sub 31 B: Kelley + School C: Kelley + School + Prior E: Shin + School F: Shin + School + Prior Small Medium Large Small Medium Large Small Medium Large Small Medium Large

116 100 Table 4.10 Signal noise ratios (student and school levels) for math Models Sub 1 Sub Sub 3 Sub 4 Sub 5 Sub 6 Student Level A: Kelley B: Kelley + School C: Kelley + School + Prior D: Shin E: Shin + School F: Shin + School + Prior School Level B: Kelley + School C: Kelley + School + Prior E: Shin + School F: Shin + School + Prior Table 4.11 Standard errors of estimate (student and school levels) for math Models Sub 1 Sub Sub 3 Sub 4 Sub 5 Sub 6 Student Level A: Kelley B: Kelley + School C: Kelley + School + Prior D: Shin E: Shin + School F: Shin + School + Prior School Level B: Kelley + School C: Kelley + School + Prior E: Shin + School F: Shin + School + Prior

117 101 Table 4.1 Sub-score separation indices (student and school levels) for math Models Sub 1 Sub 13 Sub 14 Sub 15 Sub 16 Sub 3 Student Level A: Kelley B: Kelley + School C: Kelley + School + Prior D: Shin E: Shin + School F: Shin + School + Prior School Level B: Kelley + School C: Kelley + School + Prior E: Shin + School F: Shin + School + Prior Models Sub 4 Sub 5 Sub 6 Sub 34 Sub 35 Sub 36 Student Level A: Kelley B: Kelley + School C: Kelley + School + Prior D: Shin E: Shin + School F: Shin + School + Prior School Level B: Kelley + School C: Kelley + School + Prior E: Shin + School F: Shin + School + Prior

118 10 Table 4.1 Continued Models Sub 45 Sub 46 Sub 56 Student Level A: Kelley B: Kelley + School C: Kelley + School + Prior D: Shin E: Shin + School F: Shin + School + Prior School Level B: Kelley + School C: Kelley + School + Prior E: Shin + School F: Shin + School + Prior

119 103 Figure 4.10 Student-level signal noise ratio for math Figure 4.11 Student-level standard error of estimate for math

120 104 Figure 4.1 Student-level sub-score separation index for math Figure 4.13 School-level signal noise ratio for math

121 105 Figure 4.14 School-level standard error of estimate for math Figure 4.15 School-level sub-score separation index for math

122 106 Table 4.13 Comparison of criterion ratios across models for math: Signal noise ratios Models Collateral information Sub 1 Sub Sub 3 Sub 4 Sub 5 Sub 6 Student level B vs A School C vs B Prior school means/variances C vs A School + Prior school means/variances D vs A Other sub-scores E vs D School F vs E Prior school means/variances F vs D School + Prior school means/variances E vs A Other sub-scores + School F vs A Other sub-scores + School + Prior school means/variances School level C vs B Prior school means/variances F vs E Prior school means/variances

123 107 Table 4.14 Comparison of criterion ratios across models for math: Standard errors of estimate Models Collateral information Sub 1 Sub Sub 3 Sub 4 Sub 5 Sub 6 Student level B vs A School C vs B Prior school means/variances C vs A School + Prior school means/variances D vs A Other sub-scores E vs D School F vs E Prior school means/variances F vs D School + Prior school means/variances E vs A Other sub-scores + School F vs A Other sub-scores + School + Prior school means/variances School level C vs B Prior school means/variances F vs E Prior school means/variances

124 108 Table 4.15 Comparison of criterion ratios across models for math: Sub-score separation indices Models Collateral information Sub 1 Student level B vs A School C vs B Prior school means/variances C vs A School + Prior school means/variances Sub 13 Sub 14 Sub 15 Sub 16 Sub 3 Sub 4 Sub D vs A Other sub-scores E vs D School INF INF INF INF INF INF INF F vs E Prior school means/variances F vs D School + Prior school means/variances INF INF INF INF INF INF INF E vs A Other sub-scores + School F vs A Other sub-scores + School + Prior school means/variances School level C vs B Prior school means/variances F vs E Prior school means/variances

125 109 Table 4.15 Continued Models Collateral information Sub 6 Student level B vs A School C vs B Prior school means/variances C vs A Sub 34 School + Prior school means/variances D vs A Other sub-scores E vs D School INF INF INF INF INF INF INF F vs E Prior school means/variances INF INF INF F vs D School + Prior school means/variances Sub 35 Sub 36 Sub 45 Sub 46 Sub 56 INF INF INF INF INF INF INF E vs A Other sub-scores + School F vs A Other sub-scores + School + Prior school means/variances School level C vs B Prior school means/variances F vs E Prior school means/variances INF Note: INF refers to infinity, which results from dividing a number by a real zero.

126 110 Figure 4.16 Student-level signal noise ratio ratio for math Figure 4.17 Student-level standard error of estimate ratio for math

127 Figure 4.18 Student-level sub-score separation index ratio for math 111

128 11 Table 4.16 Comparison of school-level signal noise ratios by school size for math Models School size Sub 1 Sub Sub 3 Sub 4 Sub 5 Sub 6 B: Kelley + School C: Kelley + School + Prior Small Medium Large Small Medium Large E: Shin + School Small F: Shin + School + Prior Medium Large Small Medium Large Table 4.17 Comparison of school-level standard errors of estimate by school size for math Models School size Sub 1 Sub Sub 3 Sub 4 Sub 5 Sub 6 B: Kelley + School C: Kelley + School + Prior Small Medium Large Small Medium Large E: Shin + School Small F: Shin + School + Prior Medium Large Small Medium Large

129 113 Table 4.18 Comparison of school-level sub-score separation indices by school size for math Models School size Sub 1 Sub 13 Sub 14 Sub 15 Sub 16 Sub 3 B: Kelley + School Small C: Kelley + School + Prior Medium Large Small Medium Large E: Shin + School Small F: Shin + School + Prior Medium Large Small Medium Large Models School size Sub 4 Sub 5 Sub 6 Sub 34 Sub 35 Sub 36 B: Kelley + School Small C: Kelley + School + Prior Medium Large Small Medium Large E: Shin + School Small F: Shin + School + Prior Medium Large Small Medium Large

130 114 Table 4.18 Continued Models School size Sub 45 Sub 46 Sub 56 B: Kelley + School Small C: Kelley + School + Prior Medium Large Small Medium Large E: Shin + School Small F: Shin + School + Prior Medium Large Small Medium Large

131 115 CHAPTER V. DISCUSSION AND CONCLUSION Educators and administrators routinely use state accountability assessment results to provide diagnostic information to inform instruction and curriculum planning. With the introduction of No Child Left Behind, this practice has been reinforced by the requirement that all state accountability assessments produce individual student interpretive, descriptive, and diagnostic reports. This mandate creates the need to report sub-scores in a meaningful manner. Researchers have repeatedly cautioned score users of several psychometric limitations of observed sub-scores, two of which were the focus of the present study. One was related to sub-score reliability; sub-scores might have limited reliabilities due to short lengths (Edwards and Vevea, 006; Goodman and Hambleton, 004; Haberman, 005; Haberman, Sinharay, and Puhan, 006; Monaghan, 006; Shin, 004; Wainer, Sheehan, and Wang, 000; Wainer et al., 001; Yen, 1987; Yen, 1997; Yao and Boughton, 007). The other was concerned with sub-score specificity; for many existing assessments, there might be little distinct information in sub-scores that is not reflected in total scores (Haberman, 005; Haberman et al., 006; Monaghan, 006; Sinharay, Haberman, and Puhan, 007). In response to the aforementioned issues related to psychometric properties of sub-scores, the present study was conducted to address the psychometric limitations associated with observed sub-scores by examining the effects of incorporating different levels of collateral information on sub-score estimation. The three sources of collateral information under investigation included: (1) information from other sub-scores, () schools that students attended, and (3) school-level scores on the same test taken by previous cohorts of students in that school. To examine the incremental effects of the collateral information, six Bayesian models were constructed: (1) Model A: fully Bayesian Kelley s regressed score model, a non-augmented model which estimates subscores based on items only within the strand of interest, () Model B: Model A plus

132 116 schools attended, (3) Model C: Model A plus both schools attended and previous year school-level sub-score information, (4) Model D: Shin (004) s multivariate fully Bayesian model which incorporates information from other strands, (5) Model E: Model D plus schools attended, and (6) Model F: Model D plus both schools attended and previous year school-level sub-score information. Analyses were conducted on the standardized observed score metric. Results were evaluated in light of three comparison criteria, i.e., signal noise ratio (SNR), standard error of estimate (SEE), and sub-score separation index (SSI). Data came from a state accountability assessment program and consisted of two subject areas: English language arts (ELA) and math. These two subject areas were chosen because they differ considerably in test design characteristics and each represents a typical type of assessment currently in use. In Section 5.1, each of the research questions laid out in Chapter III is revisited and discussed using the results obtained in Chapter IV. Section 5. is centered on a discussion of the issues related to sub-scores as laid out in Chapter I as well as the beginning of this chapter. Section 5.3 is focused on how incorporation of collateral information will affect the validity and interpretability of the resulting scores as well as when such scores are suitable and when they are not. Finally, limitations of the study are acknowledged and future research is suggested. 5.1 Discussion of Research Questions In this section, major results and findings of this study are summarized and discussed in light of the four research questions. Before proceeding to a detailed discussion of results and findings, it would be helpful to briefly review the models and understand the mechanism through which incorporation of each source of collateral information would affect the estimation of sub-scores at both the student level and the school level.

133 117 Shin (004) s model (Model D) is the baseline multivariate model, which incorporates information from other sub-scores by modeling the correlations among subscores. Instead of regressing to the overall group mean as in Kelley s model (Model A), augmented sub-score estimates in Shin s model are also affected by information from other sub-scores. Such a use of collateral information makes two things happen: (1) the variance of posterior means ( σ ) increases; and () the conditional posterior variance ( σ E (τˆ) τˆ ) decreases. This prediction is consistent with what was found in Wainer et al. s study (001, ). As such, it is reasonable to anticipate using such collateral information to enhance estimation precision (i.e., to increase SNR and to decrease SEE). For SSI, however, using information from other sub-scores may produce two opposite effects. One is to reduce profile variability as a result of borrowing information from one another; the other is to reduce posterior error variance as a result of incorporating more information. Therefore, the change in SSI depends on the dynamic interplay of these two opposing forces. These predictions regarding the effects of using information from other sub-scores are also applicable to school-level results when multivariate hierarchical models (Models E and F) are compared to univariate hierarchical models (Models B and C). Only two important differences may impact the findings. One is that school mean scores rather than student scores are to be augmented, and the former is more reliable than the latter. The other is that between-school correlations are used for augmentation, which may be different from correlations at the student level. Models B and E are hierarchical models with schools attended incorporated as additional collateral information. Model B is an adaptation of Kelley s model, in which a student s sub-score is assumed to follow a normal distribution with mean and variance being the school s mean and variance. Model E is an adaptation of Shin s model, in which a student s sub-scores are assumed to follow a multivariate normal distribution with means and variances/covariances being the school s means and variances/covariances. To the extent to which schools vary Models B and E produce

134 118 different estimates from their corresponding non-hierarchical counterparts. These estimates have the potential to be more precise because the school-level model serves as an informative prior for the student-level model (Gelman and Hill, 007), which in turn reduces the variance of the posterior distribution. Adding schools as a hierarchy in Shin (004) s model has an additional effect, which is to allow schools to have their own correlations. This injects an unpredictable element into the augmentation algorithm: the school-specific correlations can be higher or lower than the overall state correlations. The change in correlations triggers a change in the amount of information that sub-scores can borrow from one another, which in turn impacts the estimation precision of sub-scores. Consequently, the effects of adding a school level in Shin s model are a composite of two different effects: (1) using school-level model as the informative prior increases estimation precision, and () using school-specific correlations has unpredictable effects depending on how they compare with the overall state correlations. For SSI, the effects of using schools as collateral information depend on the nature of such information. School sub-scores are used as informative prior for estimating student sub-scores; therefore, it is reasonable to expect sub-score differences to be reinforced if they are consistent with school-level sub-score differences, but weakened if they are not. Models C and F are adaptations of Models B and E to incorporate previous year school-level sub-score information as additional collateral information. Model C incorporates previous year school-level means and variances. In addition to that, Model F incorporates previous year school-level covariances. It is important to understand that such collateral information affects school- and student-level results in a different fashion. Because previous year school-level information is implemented as an informative prior for the school-level model, it has direct impact on the school-level parameter estimation. Specifically, using such an informative prior helps reduce the posterior variance and thus improves estimation precision. Effects on SSI depend on the nature of the prior information. A school s sub-score profile is reflective of the school s curricular strengths

135 119 and weaknesses. If a school s curriculum is stable over time, its curricular profile is expected to be similar from year to year. In this scenario, the use of previous year school performance will reinforce the existing strengths and weaknesses, and thus enhance SSI. If there is any major change in its curriculum, however, its curricular profile is expected to change as well. In this scenario, the use of previous year school performance will weaken the existing strengths and weaknesses, and thus reduce SSI. For student-level results, incorporation of such collateral information proceeds in two steps. First, previous year school-level sub-score information serves as a prior for the school-level model; it is combined with the current year data to produce posterior distributions for school-level parameters. Second, school-level parameter estimates thus obtained are used to estimate student-level parameters. For example, school means serve as informative priors for estimating student-level sub-scores, and school variances/covariances are used to determine how much information sub-scores can borrow from one another for students within that school. Therefore, incorporation of such collateral information impacts student-level results indirectly through school-level results, and thus makes its effects difficult to predict Discussion for Research Question 1 The first research question was: (1) What effects does the use of three sources of collateral information have on student-level sub-score estimation in terms of signal/noise ratio, standard error of estimate, and sub-score separation index? Do the findings differ for different tests (i.e., ELA and math)? The incremental effects of the three sources of collateral information were examined by comparing the six Bayesian models. In this section, the effects of each of them are summarized and discussed.

136 Effects of Using Information from Other Sub-scores To investigate the effects of incorporating information from other sub-scores, Shin (004) s model (Model D) was compared to Kelley s model (Model A). Results reveal a consistent trend that incorporation of such collateral information resulted in increased SNR as well as decreased SEE and SSI. Such a pattern suggested two findings: (1) enhanced estimation precision, and () reduced profile variability. The first finding was in agreement with what was found in previous research, i.e., estimating sub-scores using information from other portions of the test produced more precise results (e.g., Edwards and Vevea, 006; Yao and Boughton, 007; Wainer et al., 000; Wainer et al., 001; Zhu and Stone, 008). The second finding addressed an issue that has been less well-researched in the literature. It appeared that sub-score differences were reduced significantly after augmentation. The extent of reduction depended on the reliabilities of the original sub-scores and their correlations. Specifically, lower reliabilities and higher correlations corresponded to greater reductions in SSI. This pattern is sensible because the smaller the subscales, the less unique information they carry; the more highly correlated, the more information borrowed from one another. The effects of incorporating information from other sub-scores were consistent for both subject areas. The differences were only a matter of magnitude, and they could be accounted for in light of sub-score reliabilities and correlations. Recall that the tests for ELA and math differed substantially in terms of these two test design characteristics. Specifically, the ELA test consisted of three strands that had relatively high reliabilities by themselves, especially for the first two strands (with reliabilities in 0.80 s), and their correlations were only moderate (with disattenuated correlations in mid 0.70 s and low 0.80 s). The math test consisted of six strands that had quite limited reliabilities (mostly in 0.50 s and 0.60 s), and their disattenuated correlations were mostly in 0.90 s. These differences led to augmentation of different strengths for these two tests. Only moderate augmentation occurred for the ELA test, which could be seen from the moderate SNR

137 11 ratios (i.e., only less than 1.50 for the first two sub-scores, and a little over.00 for the third sub-scores), whereas substantial augmentation occurred for the math test, which could be seen from the high SNR ratios (i.e., mostly over 4.00). Sub-score differences for ELA were substantially reduced; however, they were almost eliminated for math. The findings for the math test echoed the concerns voiced by many researchers in the literature, i.e., augmented sub-scores obtained in this fashion were essentially replications of the total score and thus served little diagnostic purpose (e.g., Sinharay et al., 007). Just as was suggested in the literature (Wainer et al., 001; Haberman, 007), augmentation using information from other sub-scores seemed to hold some promise for tests like ELA, in which sub-scores have relatively high reliabilities by themselves and they are only moderately correlated. To recapitulate, using information from other sub-scores enhanced sub-score precision but reduced profile variability. Generally speaking, people tend to think of enhanced sub-score precision as a pro and reduced profile variability as a con. However, such an interpretation is subject to question. First, is the improvement in precision associated with augmented sub-scores warranted and valid? As was explained in Chapter II, sub-score augmentation works via a mechanism that allows sub-scores to borrow information from other portions of the test to the extent to which the constructs they measure are similar. As is known, total true score variance consists of common variance and unique variance. Ideally, common variance is what the augmentation procedure can borrow to increase sub-score precision, and unique variance is what makes sub-scores different and thus should be preserved for diagnostic purposes. The reduced SSI found in the present study seemed to suggest that information was borrowed from unique variance as well, which might alert us to the possibility of a new problem: construct contamination. If such a possibility is proven true, then the validity of the resulting subscores may be called into question. Simulation studies are needed to investigate this issue further. Second, do the variable profiles associated with raw sub-scores render them

138 1 better diagnostic tools? Researchers have repeatedly cautioned sub-scores users that when sub-scores were of limited reliability, using sub-score differences for diagnosis was almost equivalent to chasing errors of measurement (Wainer et al., 001). To put it differently, these sub-score differences were so unreliable that they might disappear when these students were tested again. Any remedial action based on such information could be a waste of resources. The profiles were much flatter after augmentation was implemented. However, it was unclear whether such a flattening effect was a result of removing random fluctuations due to measurement error, or eliminating some of the true sub-score differences, or a combination of both. Simulation studies are needed to address this issue as well Effects of Using Schools as Collateral Information Models B and A were compared to examine the effects of using schools as collateral information in a univariate framework. Models E and D were compared to examine the effects of using the same source of information in a multivariate framework. In the univariate framework, using schools as a hierarchy generally led to more estimation precision, although only to a small extent. As was explained previously, such a gain depended on school heterogeneity; the more heterogeneous schools were, the larger the gain. In the multivariate framework, however, using such collateral information produced different results for ELA and math. Generally speaking, it enhanced precision for ELA but reduced precision for math. Such differences arose because disaggregating the entire sample by schools had different impact on school-level correlations for these two subject areas. Specifically, as was shown in Appendix C, sub-score correlations were lower at the school level than at the state level for math, but such was not as manifest for ELA. As such, two opposing forces were in play for math. One was that adding schools as a hierarchy tended to increase precision; the other was that disaggregating the entire sample by schools lowered the correlations used in the multivariate model, and thus

139 13 reduced the gain in precision resulting from using information from other sub-scores. For the math test, the latter outweighed the former, thus resulting in a reduction in precision. With respect to profile variability, the results were generally mixed for both the univariate and multivariate models, and this held true for both ELA and math. This finding was anticipated because school effects depend on the consistency between student and school profiles. When they were consistent, student profiles could be accentuated as a result of using school profiles as informative prior. When they were not consistent, student profiles could be weakened. Given the above-mentioned findings, how shall we evaluate the pros and cons of incorporating schools as collateral information from a psychometric perspective? For the univariate model, using such information led to enhanced estimation precision across board, although only to a small extent. Such a gain from a psychometric perspective needs to be combined with other perspectives (e.g., practicality, fairness, and costs) to determine whether it is worthwhile to implement this model. For the multivariate model, such information resulted in enhanced precision for ELA but reduced precision for math. However, I would argue that these results lent even more support for implementing a hierarchical model for math when considered from a validity rather than reliability perspective. Recall that a major motive for using a hierarchical model in the present study was to address Simpson s paradox (Wainer et al., 000), which suggests that correlations for different subgroups may not be identical, and using the same correlation matrix for the entire group may result in bias and thus threaten the validity of the resulting scores. This was evidently the case with the math test. Correlations among sub-scores appeared to vary considerably for different schools. The almost perfect correlations at the state level did not seem to hold true when broken down to the school level. Consequently, using the state-level correlations to augment sub-scores for all students as in the nonhierarchical models seemed to be questionable in terms of validity. A sound practice in the future would be to check correlations for important sub-groups and see how invariant

140 14 they are; if they vary substantially, using one correlation matrix for all subgroups will not be warranted Effects of Using Previous Year School-Level Subscore Information Models C and B were compared to examine the effects of using previous year school means and variances. Models F and E were compared to examine the effects of using previous year school means and variances/covariances. For math, due to some mathematical constraints as noted in Chapter IV, previous year school covariances were not incorporated. When prior information was only restricted to means and variances as in Model C for both subjects and Model F for math, the effects were generally either very small or mixed in terms of both estimation precision and profile variability. The small magnitudes or mixed patterns were anticipated given that such collateral information was incorporated at the school level and only affected student-level results indirectly through school-level results. When previous year covariances were incorporated as in Model F for ELA, however, a consistent pattern was observed: estimation precision was reduced while sub-score profile variability was increased. The finding could be explained in light of the nature of the prior information incorporated. As Table 3.3 show, the correlations were lower for the 006 sample than for the 007 sample. When the 006 correlations were combined with the 007 correlations in the Bayesian algorithm, this led to lower correlations used in the augmentation algorithm, and thus less gain in precision resulting from using information from other sub-scores. A common question concerning using previous year school covariance information is whether such information is legitimate for use in estimating the current year s sub-scores. The answer is that it depends. If a school has recently experienced a substantial curricular change, the relationships among sub-scores from previous years may not be applicable to this year. However, if the school curriculum is stable, sub-score

141 15 correlations may hold steady across years. Sub-score users are advised to determine how much weight should be given to previous year school covariance information depending on how stable the school curriculum is Discussion for Research Question The second research question was: () For the four hierarchical models (i.e., Models B, C, E and F), what effects does the use of collateral information have on school-level sub-score estimation in terms of signal/noise ratio, standard error of estimate, and sub-score separation index? Do the findings differ for different tests (i.e., ELA and math)? At the school level, two sources of collateral information were investigated: the effects of using information from other sub-scores (Models B vs E), and the effects of incorporating previous year school-level sub-score information into the univariate and multivariate models, respectively (Models C vs B, and Models F vs E). In this section, related findings are described and discussed. At the school level, the effects of using information from other sub-scores were found to follow a similar pattern to what was observed with student-level results: enhanced estimation precision (larger SNR and smaller SEE) and reduced profile variability (smaller SSI). However, similar as the pattern was, the magnitudes of change were substantially smaller, suggesting that less augmentation occurred at the school level than at the student level. This finding was dictated by the theory that the effect of regressing toward the mean is a function of the error of measurement; the less error involved, the less regression occurs (Wainer etl al., 001). School means were more reliable than individual student scores, and thus less information was borrowed from other sub-scores. Another notable finding was concerned with the effects on profile variability. Recall that at the student level, the profile variability was substantially

142 16 reduced for ELA and almost diminished for math. At the school level, however, such variability was preserved surprisingly well for both ELA and math. This could also be explained by the reason cited above, i.e., school means were more reliable and thus less information from other sub-scores was borrowed. These findings held true for both subject tests with differences only in magnitude, which could be explained in light of sub-score reliabilities and/or correlations. For the univariate model, previous year school means and variances were incorporated as priors for both subject tests. There was a consistent pattern that using such collateral information produced sub-scores with more estimation precision (i.e., higher SNR and lower SEE) and more variable profiles (i.e., higher SSI). This pattern was true for both ELA and math. Notably, when compared to using information from other sub-scores, using previous year school information even enhanced estimation precision to a greater extent in most comparisons. This suggested that the latter was even more effective than the former in enhancing estimation precision at the school level, while the reverse was true at the student level. The increase in profile variability was a result of accumulating school profile information over years. The rationale behind it could be explained by the following scenario. Suppose a school s curriculum is weak in a certain area, and one test does not provide sufficient information to reveal it. However, when this weakness is consistent over time, such profile information may accumulate through repeated testing and become manifest. Therefore, the effects of such collateral information on profile variability are not only under the dictation of the information theory, but also influenced by the nature of the prior information. For the multivariate model, the results for both ELA and math yielded a pattern consistent with what was observed with the univariate model: enhanced estimation precision and increased profile variability. Given the aforementioned findings, how shall we evaluate the utility of using the two sources of collateral information at the school level? Comparatively speaking, using

143 17 previous year school sub-score information seemed to have the potential of strengthening a curricular profile when it is consistent over time; however, using information from other sub-scores appeared to lead to reduced profile variability. This suggests that previous-year sub-score information holds more promise than within-year sub-score information, when sub-scores are primarily used for diagnostic purposes. In terms of estimation precision, generally speaking, these two sources of collateral information produced pretty comparable results, with within-year sub-score information slightly outperforming the former. This superior performance associated with using information from other sub-scores renders it a better option when the primary purpose of using subscores is to rank schools in different content areas. However, this slight benefit in estimation precision needs to be evaluated in conjunction with its drawbacks associated with other aspects of interest, such as potential construct contamination and reduced diagnostic effectiveness as mentioned in Section Overall speaking, at the school level, incorporating previous year school information in a multi-level Kelley s model (Model C) seemed to produce reasonably satisfactory results in terms of enhancing subscore reliability and preserving/accentuating curricular profiles, without creating potential problems such as construct contamination Discussion for Research Question 3 The third research question was: (3) For the four hierarchical models (i.e., Models B, C, E and F), how do sub-score estimates compare at the student- and school-levels in terms of the three criteria? Do the findings differ for different tests (i.e., ELA and math)? Student- and school-level results were compared for each of the four hierarchical models. It was found that across all the four models, school-level results showed considerably higher estimation precision than student-level results. The gains were larger

144 18 when sub-scores were less reliable. Consequently, the differences were larger for math than for ELA because sub-scores for math tended to have lower reliabilities. With regard to the absolute magnitude of the precision-related criteria, although most of student-level sub-scores did not show sufficient precision required for reporting, school-level subscores exhibited very decent precision. With regard to profile variability, generally speaking, larger SSI was observed at the school level than at the student level, although some exceptions occurred for ELA. It was reasonable to anticipate some mixed results because school-level SSI and studentlevel SSI refer to two different concepts. A school profile refers to the school s curricular profile, which reflects strengths and weaknesses associated with the school s curriculum. A student profile refers to the student s personal profile, which may reflect a combination of the school s curricular profile and his/her idiosyncratic pattern of strengths and weaknesses. There is no theory dictating the relative magnitude of the two. However, some general knowledge about school curricula may help formulate some reasonable hypotheses. For example, it is reasonable to believe that students within a school may show profiles that are more similar than those from different schools. If this is the case, then it is possible that a profile of strengths and weaknesses is present but not strong enough to manifest itself at the student level due to lack of estimation precision; however, if there are a sufficient number of students within a school showing a similar profile, such a profile will manifest itself as a curricular profile for that school. The high level of estimation precision associated with school-level results allows for detection of profiles that are hard to detect at the student level. This hypothesis can be supported by the generally larger magnitude of SSI at the school level than at the student level. Several researchers who compared school- and student-level results (Haberman, Sinharay, and Puhan., 006; Sinharay, Haberman, and Puhan, 007) found little value in student-level sub-scores, but still recommended additional research on school-level subscores hoping to find some value in them. The findings in the present study undoubtedly

145 19 represent good news: school-level sub-scores showed not only satisfactory reliabilities, but also satisfactory diagnostic value. Among the four models, the two Kelley s models seemed to hold the most promise because they could offer both reasonable estimation precision and diagnostic value without the potential drawback of construct contamination Discussion for Research Question 4 The fourth research question was: (4) Do the effects of using collateral information differ for schools of different sizes? School-level results for the four hierarchical models were disaggregated by school size: small, medium, and large. They were compared in light of the three comparison criteria. This was intended to be an exploratory analysis to extract meaningful patterns. Generally speaking, across all models, larger school sizes seemed to correspond to larger SNR and SSI, as well as smaller SEE. There appeared to be greater disparity between large and medium schools than between medium and small schools. As more collateral information was incorporated, small and medium schools seemed to garner larger increases in SNR and larger decreases in SEE than large schools. This finding held true for both subject tests. It was consistent with the general finding in the literature that the more unreliable the sub-scores are, the more strength they borrow from other collateral information (Wainer et al., 001). Because sub-scores tended to be less reliable for small and medium schools, they should benefit more from the use of collateral information in those schools. For profile variability, there seemed to be no consistent pattern for ELA; for math, however, small and medium schools seemed to garner larger increases in SSI than large schools.

146 Discussion for Problems Related to Sub-scores The research literature has pointed out several psychometric problems that observed sub-scores may have. The present study focused on addressing two of them. One was related to sub-score reliability; sub-scores might have limited reliabilities due to short lengths (Edwards and Vevea, 006; Goodman and Hambleton, 004; Haberman, 005; Haberman, Sinharay, and Puhan, 006; Monaghan, 006; Shin, 004; Wainer, Sheehan, and Wang, 000; Wainer et al., 001; Yen, 1987; Yen, 1997; Yao and Boughton, 007). The other was concerned with sub-score specificity; for many existing assessments, there might be little distinct information in sub-scores that is not reflected in total scores (Haberman, 005; Haberman et al., 006; Monaghan, 006; Sinharay, Haberman, and Puhan, 007). This section is devoted to a discussion on how these two problems were addressed using the three sources of collateral information investigated in the study. The discussion is divided into two parts: student-level results and school-level results Student-Level Results Using information from other sub-scores seemed to produce the largest gains in estimation precision. Sub-scores augmented in this fashion usually had sufficient reliabilities for reporting purposes. This was consistent with what was reported in the literature, and provided a satisfactory solution to the first problem. However, such gains in precision came along with reduced sub-score profile variability. This was reflected by dramatic reductions of SSI after augmentation. It was unclear whether such reductions in profile variability resulted from borrowing unique variances from other sub-scores, which would represent a threat to validity as a result of construct contamination. Further investigation would be warranted on this issue. As far as the second problem was concerned, using such collateral information did not help make sub-scores more distinct from one another. The reason was obvious: to make a sub-score more distinct from other

147 131 sub-scores, we need to increase the variance unique to that particular sub-score. What the current sub-score augmentation procedures do is to borrow common variances from other sub-scores, and possibly some unique variances as well. Therefore, using information from other sub-scores for augmentation addressed the first problem well, but not the second problem. Adding schools as a hierarchy in Kelley s model seemed to produce consistent but small gains in estimation precision. The magnitude of gains depends on school heterogeneity and may change upon replications with different data and different samples. The direction of the effects was positive, suggesting that using such collateral information might hold promise when the sample is heterogeneous and the gain in precision may be large enough to warrant the implementation of a hierarchical model. Adding schools as a hierarchy in Shin (004) s model produced enhanced precision for ELA but reduced precision for math. The counter-intuitive finding regarding math was a result of reduced correlations after the entire sample was disaggregated by schools. This seeming loss in precision actually implied a gain in validity, a stronger argument for using a hierarchical model with schools as a level. This could address Simpson s paradox (Wainer et al., 001), which might threaten the validity of augmented sub-scores when correlations are not invariant across subgroups, but an overall correlation matrix is used for all. To sum up, adding schools as a hierarchy seemed to have the potential to enhance precision to the extent that the sample is heterogeneous and improve validity to the extent that correlations are not invariant across subgroups. With regard to the second problem, using schools as collateral information seemed to hold some promise in theory because such collateral information could add to the unique variance of a particular sub-score. However, whether this added unique variance would make sub-scores more different or similar for a particular student depends on whether the school s profile is consistent with the student s profile. If it is consistent, it will accentuate the student s profile; otherwise, it will weaken the student s profile. Therefore, in reality, mixed results should be

148 13 anticipated, and this was the case with the present study. However, a legitimate question that people may ask is whether it is appropriate to use school profile as prior information for student profile as a means to add unique variance. This question is similar in nature to the question concerning the appropriateness of using any multi-level model to estimate student scores. The only difference is whether group means or group profiles are used as prior information for student estimates. It all depends on whether we are willing to assume students within a school belong to a common subpopulation. Incorporating previous year school-level sub-score information produced mixed results for both estimation precision and profile variability, suggesting limited utility in addressing both problems. However, tracing the root of this finding reveals that such collateral information was incorporated at the school level and only affected student-level results indirectly through school-level results. It is hypothesized that both problems can possibly be addressed if previous year student-level sub-score information is used as prior. First, previous year student sub-score information serves as an informative prior in estimating current year student sub-scores. Under Bayesian theory, this will reduce the variance of the posterior distribution as a result of incorporating informative prior information. Hence, the first problem could possibly be addressed. Second, previous year student sub-score information also adds to the unique variances of sub-scores, thus enhancing the information related to the unique part of the construct that the particular sub-score is purported to measure. Similarly, when student profiles are consistent over time, which should generally be the case, profiles will be accentuated with accumulative use of test information; otherwise, they will be weakened. Matched longitudinal data are required to obtain previous student-level data. As was discussed in Section 1.4, it should be noted that content differences across grade need to be considered when such information is incorporated. Therefore, future research is needed to investigate whether and the extent to which incorporating student-level prior information will increase both precision and distinct information of sub-scores.

149 School-Level Results At the school level, using information from other sub-scores seemed to produce some gains in precision, but the extent of which was modest at best. There also seemed to be some reduction in profile variability, although only to a small extent. As such, a similar conclusion could be made that using such information at the school level addressed the first problem but not the second problem. Using previous year school-level means and variances produced gains in both precision and profile variability, suggesting that such information could be used to address both the first and the second problems. However, incorporating previous year school-level covariances into the multivariate model added an extra layer of complexity; the nature of the covariance information incorporated also played a role in impacting the estimation precision. Incorporation of previous year school covariances reduced estimation precision because the correlations were lower for the 006 sample than for the 007 sample. The appropriateness of incorporating such information depended on whether any change in sub-score correlations would be anticipated from year to year (e.g., as a result of curricular changes). It should also be noted that even the baseline hierarchical model (i.e., Model B, hierarchical Kelley s model) produced fairly satisfactory results in terms of both precision and profile variability; thus, the two problems seemed to be automatically addressed when school-level results were of interest. 5.3 Validity, Interpretability and Suitability of Augmented Sub-scores State accountability assessments are not primarily built to diagnose; however, the NCLB legislature requires them to provide diagnostic information as well. As such, the problems associated with sub-scores may mainly be a reflection of some practices in the test development stage. For example, an important criterion for evaluating the quality of an item is its biserial correlation with the rest of the test; an item with a low biserial

150 134 correlation is less likely to be retained in the test. Similarly, when a unidimensional IRT model is used to construct a test, items that do not conform to the model are usually considered poor-fitting and thus are more likely to be removed from the test. These practices generally lead to the building of tests that are unidimensional in nature. Wainer et al. (001) suggested that to provide diagnostic information, a sensible route should be to build a multidimensional test from the ground up, including items with high correlations with items within the strand but low correlations with items in other strands. These are some implications on future test development if the goal is to provide diagnostic information. However, the task that test developers and score users are currently facing is to squeeze diagnostic information out of an existing accountability assessment, which often turns out to be less multidimensional than desired. The nature of the task can be described as follows: How much water can we squeeze out of a stone, even though it is a wet stone? The main purpose of the preceding discussion is to acknowledge the psychometric limitations inherent in sub-scores based on assessments that are not designed to provide diagnostic information. Such psychometric limitations may be improved to some extent through the use of some statistical procedures (i.e., augmentation) as proposed in the literature or in the present study. However, these procedures may affect other aspects of the resulting scores, such as validity, interpretability, and suitability. The primary goal of augmenting sub-scores is to improve reliability by using collateral information. Consequently, at least two effects may occur: (1) error variance may be reduced; and () bias may be introduced. The former may lead to improved reliability, whereas the latter may undermine validity. The use of collateral information is valid only to the extent to which its relationship with the sub-score of interest is true and invariant for all the individuals in the group to which it applies. To the extent this assumption is violated, the validity of augmented sub-scores shall be compromised. In reality, however, such violations are almost unavoidable. For example, when information

151 135 from other sub-scores is borrowed, even in a hierarchical model, all individuals in a group are assumed to share the same correlations. However, an individual can possibly belong to an unlimited number of groups, and all these group memberships have the potential to affect the correlations. In this sense, each individual is his own group and should have his own correlations. However, this is not statistically possible. Therefore, an attempt to represent an individual s correlations with group-level correlations may bring in invalidity to augmented sub-scores. Another example is with regard to the use of school membership and school past performance. Using such information is valid only when it is appropriate to associate a certain individual with a certain school. This is clearly not the case when an individual only recently moved into a school. How shall we gauge the impact of augmentation in terms of improvement in reliability and potential loss in validity? It is important to emphasize that reliability is a prerequisite of validity (Gay, 1987). Therefore, these two opposite effects of augmentation on validity strongly call for validity studies to be conducted on a case-by-case basis. The use of collateral information renders the interpretation of the resulting subscores less straightforward. For example, when reading sub-scores are computed only based on the items designed to measure the reading construct, we can say that the resulting sub-scores can be interpreted as the level of reading skills that individuals possess, although our confidence about the accuracy of such a representation depends on the amount of measurement error involved. However, when other collateral information is incorporated, the resulting sub-scores become less interpretable solely from a construct perspective, as collateral information does not directly measure the construct of interest. How shall augmented sub-scores be interpreted? Wainer et al. (001) suggests that augmented sub-scores could best be viewed as true score estimates. Kelley s regressed score estimate can be viewed as the simplest form of true sub-score estimates, approximating true sub-scores by using group means as collateral information. Augmented sub-scores can also be viewed as true sub-score estimates, and the only

152 136 difference is that other types of collateral information are used to better approximate true sub-scores. Therefore, augmented reading sub-scores can still be interpreted as measures of reading skills to the extent to which the collateral information incorporated is valid. When are augmented sub-scores suitable for use? Augmented sub-scores are found to be closer to true scores (Edwards and Vevea, 006; Shin, 004; Wainer et al., 000; Wainer et al., 001), which provides a strong psychometric ground for their use. In practice, however, perspectives other than psychometrics should also be taken into account, such as fairness and equity, convenience of implementation, and ease of interpretation. Here, suitability of using augmented sub-scores is discussed mainly from the perspectives of psychometrics and fairness/equity. Wainer et al. (001, ) provide an elegant treatment of this topic. The issue of suitability depends on the test purposes. According to Wainer et al. (001), there are in general three purposes for a test: measurement, contest, and prod. Placement and diagnostic tests are principally for measurement, but also serve as prods. Admissions tests and tests used to award scholarships are principally contests and also serve as prods. Wainer et al. (001) suggested that when the test is principally a contest, issues of fairness and equity may arise when collateral information incorporated is outside the control of the examinee, such as school membership and school past performance. Evidently, when two examinees have identical scores, it would be unfair to admit the one from a higher-performing school, even though his/her true score may be higher than that of his rival from a lowerperforming school. However, when the test is principally for measurement, it is scored better using collateral information because the goal is to produce scores with the smallest possible error. This is especially true with diagnostic tests because it is assumed that it is to everyone s advantage to have the most accurate scores and thus the most accurate diagnosis of strengths and weaknesses. If, for some reason, such an assumption does not hold true, issues of fairness and equity may arise as well when collateral information is outside the examinee s control. In the context of NCLB, remedial programs are usually

153 137 implemented either at the district level or at the school level. Therefore, if collateral information is incorporated from the same level as the implementation of remedial programs, then fairness and equity would become less an issue, because students from the same school or district have the same collateral information. The discussion above is mainly from a perspective of score reporting. However, an equally important perspective is that of score uses on the part of end score users, such as teachers, students, administrators, policy-makers, and admissions officers. They use scores to help their decision-making. It is recommended that decisions should not be made based on one single test score. It is common that decision-makers use test scores in conjunction with other factors to arrive at a conclusion. When viewed from this broad perspective, the appropriateness of the Bayesian approach to sub-score augmentation may not be such a controversial issue. When end score users use sub-scores, they combine scores heuristically with other collateral information in their decision-making process. The difference with the Bayesian approach is that it combines such information in a more formal (albeit mathematical) manner. It should be recognized that there are many alternative types of collateral information that can be used other than what was reported in the study. It is the users responsibility to judge what collateral information is appropriate to use in their own circumstances, be it student past test performance, gradepoint-average, courses taken, or more sensitive information such as ethnicity, gender and schools. When it gets outside the scope of score reporting, the Bayesian approach may hold more promise in helping end score users with their decision-making process. 5.4 Limitations and Future Research This study has a number of limitations. First, it employs data from only one state accountability assessment. Although two subject tests with very different design characteristics were chosen, it could be possible that using other tests might lead to somewhat different results and conclusions. Therefore, it would be important to

154 138 investigate the use of collateral information using data from different assessment programs. Second, this study involves data from one state only. However, the effects of collateral information heavily depend on the nature of the data. For example, the effects of using school membership depend on how schools differ in their means and correlations; the more heterogeneous the larger the effects. Also, the effects of using school past performance depend on school curriculum and its stability over time. Replications with different samples might have different findings. Therefore, future research is needed to extend this study to samples with different characteristics. Third, this study also suffers limitations due to the use of real data. For example, sub-score profile variability was found to be greatly reduced as a result of using information from other sub-scores. However, it was unclear whether such a reduction might be attributed to elimination of random fluctuations due to measurement error, or true sub-score differences, or a combination of both. Simulation studies are needed to address this issue. Fourth, this study is conducted only on the standardized score metric. However, this metric is not free of problems. One of the problems that it has is that all strands are standardized to have mean of zero and standard deviation of one. Such constraints may be unreasonable when strands vary in difficulty and spread. Therefore, future research is strongly called for to use different methodologies to estimate sub-scores that may overcome the aforementioned limitations. IRT, particularly Multidimensional IRT, may hold strong appeal. Fifth, this study only investigates three sources of collateral information. However, there are many other possibilities yet to be explored. For example, based on what this study revealed about the effects of using school past performance, it is strongly suspected that previous year sub-score information for the same students may help improve student-level sub-score estimation in terms of both precision and profile

155 139 variability. Therefore, future research is strongly encouraged to use matched longitudinal data to investigate the effects of using student previous performance on student-level subscore estimation. Other useful collateral information that warrant future research includes more detailed information at both the student and school level, such as student course work, student course grade and school characteristics.

156 140 REFERENCES Adams, R.J., Wilson, M.R., & Wu, M.L. (1997) Multilevel item response modelling: An approach to errors in variables regression. Journal of Educational and Behavioral Statistics,, Beaton, A., & Zwick, R. (199). Overview for the National Assessment of Educational Progress. Journal of Educational Statistics, 17, Bock, R. D., Thissen, D., & Zimowski, M. F. (1997). IRT estimation of domain scores. Journal of Educational Measurement, 34, Bolstad, W. (004). Introduction to Bayesian statistics. Hoboken, NJ: Wiley. Brennan, R. L. (001). Generalizability theory. New York: Springer Verlag. Carlin, B. P., & Louis, T. (000). Empirical Bayes: Past, present and future. Journal of American Statistical Association, 95(45), Cowles M.K., & Carlin B.P. (1996). Markov chain Monte Carlo convergence diagnostics: A comparative review. Journal of the American Statistical Association, 91: Cronbach, L. J., & Gleser, G. C. ( 1964 ). The signal/noise ratio in the comparison of reliability coefficients. Educational and Psychological Measurement, 4(3), Dwyer, A., Boughton, K. A., Yao, L., Lewis, D., & Steffen, M. (006). A comparison of subscore augmentation methods using empirical data. Paper presented at the annual meeting of the National Council on Measurement in Education in San Francisco. Eberly, L. E., & Carlin, B. P. (000). Identifiability and convergence issues for Markov chain Monte Carlo fitting of spatial models. Statistics in Medicine, 19, Edwards, M. C., & Vevea, J. (006). An empirical Bayes approach to subscore augmentation: How much strength can we borrow? Journal of Educational and Behavioral Statistics, 31(3), Fischer, G. H. (1973). The linear logistic test model as an instrument in educational research. Acta Psychologica, 37, Fischer, G. H., & Formann, A. K. (198). Some applications of logistic latent trait models with linear constraints on the parameters. Applied Psychological Measurement, 6(4), Gay, L.R. (1987). "Selection of measurement instruments." In Educational Research: Competencies for Analysis and Application (3rd ed.). New York: Macmillan. Gelman, A, & Rubin, D. B. (199). Inference from iterative simulation using multiple sequences, Statistical Science, 7, Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (003). Bayesian data analysis ( nd ed.). New York: Chapman and Hall.

157 141 Gelman, A., & Hill, J. (007). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press. Gilks, W. R., Richardson, S., and Spiegelhalter, D. J. (1995). Introducing Markov chain Monte Carlo. In W. R. Gilks, S, Richardson, & D. J., Spiegelhalter (Ed.), Markov chain Monte Carlo in practice. (pp.1-0). Chapman and Hall, New York. Goodman, D. P., & Hambleton, R. K. (004). Student test score reports and interpretive guides: Review of current practices and suggestions for future research. Applied Measurement in Education, 17(), Haberman, S. J. (005). When can sub-scores have value? (ETS RR-05-08). Princeton, NJ: Educational Testing Service. Haberman, S. J., Sinharay, S., & Puhan, G. (006). Sub-scores for institutions. (ETS RR ). Princeton, NJ: Educational Testing Service. Hall, E. (007). Using collateral item and examinee information to improve IRT item parameter estimation. Ph.D. dissertation, The University of Iowa, United States -- Iowa. Retrieved November 5, 008, from Dissertations & University of Iowa database. (Publication No. AAT ). Kelley, T. L. (197). Interpretation of educational measurements. New York: Macmillan. Kim, J. S., & Bolt, D. M. (007). Estimating item response theory models using Markov chain Monte Carlo methods. Educational Measurement: Issues and practice, 6(4), Lee, P. M. (1989). Bayesian Statistics: an Introduction ( nd ed). Arnold, New York. Ligon, G. D. (007). Why Eva Baker doesn't seem to understand accountability: The politimetrics of accountability. Third Education Group Review / Articles, 3(1). Available from Lindley, D. V. (1971). Bayesian Statistics: A Review. SIAM, Philadelphia, PA. Mann, H. M. & Moulder, B. (008). Minimizing bias in diagnostic feedback. Paper presented at the annual meeting of the National Council on Educational Measurement, New York City, NY. Mislevy, R. J. (1993). Some formulas for use with Bayesian ability estimates. (ETS RR- 93-3). Princeton, NJ: Educational Testing Service. Mislevy, R. J. (1987). Exploiting auxiliary information about examinees in the estimation of item parameters. Applied Psychological Measurement, 11(1), Mislevy, R. J. (1988). Exploiting auxiliary information about items in the estimation of Rasch item difficulty parameters. Applied Psychological Measurement, 1(3), Mislevy, R. J., Beaton, A., Kaplan, B., & Sheehan, K. (199). Estimating population characteristics from sparse matrix samples of item responses. Journal of Educational Measurement, 9,

158 14 Mislevy, R., & Sheehan, K. (1989). The role of collateral information about examinees in item parameter estimation. Psychometrika, 54(4), Monaghan, W. (006). The facts about subscores. R&D Connections. Princeton, NJ: Educational Testing Service. Available from Muthen, B. O. (1989). Latent variable modeling in heterogeneous populations. Psychometrika, 54(4), Rannala, B. (00). Identifiability of parameters in MCMC Bayesian inference of Phylogeny. Systems Biology, 51(5), Shin, D. (004). A comparison of methods of estimating objective scores. Ph.D. dissertation, The University of Iowa, United States -- Iowa. Retrieved December 10, 007, from ProQuest Digital Dissertations database. (Publication No. AAT ). Shin, D. (008). Using Bayesian sequential analyses in evaluating the prior effect for the estimation of subscale scores. Paper presented at the annual meeting of the National Council on Measurement in Education in New York City. Sinharay, S., Haberman, S., Puhan, G. (007). Sub-scores based on classical test theory: To report or not to report. Educational Measurement: Issues and Practice, 6 (4), 1-8. Spiegelhalter, D. J., Thomas, A., Best, N. G., & Gilks, W. R. (1996). BUGS: Bayesian inference using Gibbs sampling. Version 0.5 (version ii) Cambridge Medical Research Council Biostatistics Unit. Available: Spiegelhalter, D. J., Thomas, A., Best, N. G., & Lunn, D. (004). WinBUGS version 1.4.1: user manual. Cambridge Medical Research Council Biostatistics Unit. Available from Suppes, P., Jerman, M., Brian, D. (1968). Computer Assisted Instruction: Stanford s Arithmetic Program. New York, London: Academic Press. Tate, R. L. (004). Implications of multidimensionality for total score and sub-score performance. Applied Measurement in Education, 17(), Thissen, D. & Edwards, M. C. (005). Diagnostic scores augmented using multidimensional item response theory: Preliminary investigation of MCMC strategies. Paper presented at the annual meeting of the National Council on Educational Measurement, Montreal, Canada. Thulasiraman, K. & Swamy, M. N. S. (199). Graphs: Theory and Algorithms. Malden: Wiley Inter-science. Wainer, H., Sheehan, K., & Wang, X. (000). Some paths toward making praxis scores more useful. Journal of Educational Measurement, 37,

159 143 Wainer, H., Vevea, J. L., Camacho, F., Reeve, B. B., Rosa, K., Nelson, L., et al. (001). Augmented scores: Borrowing Strength to compute scores based on small numbers of items. In D. Thissen & H. Wainer (Eds.), Test scoring (pp ). Mahwah, NJ: Lawrence Erlbaum. Walker, C. M., & Beretvas, S. N. (003). Comparing multidimensional and unidimensional proficiency classifications: Multidimensional IRT as a diagnostic aid. Journal of Educational Measurement, 40, Wang, W., Chen, P., & Cheng, Y. (004). Improving estimation precision of test batteries using multidimensional item response models. Psychological Methods, 9, Xie, Y., & Carlin, B. P. (006). Measures of Bayesian learning and identifiability in hierarchical models. Journal of Statistical Planning and Inference, 136(10), Yao, L. & Boughton, K. A. (007). A multidimensional item response modeling approach for improving subscale proficiency estimation and classification. Applied Psychological Measurement, 31, Yen, W. M. (1987, April). A Bayesian/IRT index of objective performance. Paper presented at the annual meeting of the Psychometric Society, June, Montreal, Quaebec, Canada. Yen, W. M., Sykes, R. C., Ito, K., & Julian, M. (1997). A Bayesian / IRT index of objective performance for tests with mixed-item types. Paper presented at the annual meeting of the National Council on Measurement in Education in Chicago. Zhu, X. & Stone, C. A. (008). An evaluation of different approaches to sub-score augmentation for the Multistate Bar Examination. Paper presented at the annual meeting of the National Council on Educational Measurement, New York City, NY.

160 144 APPENDIX A. WINBUGS CODES ##Kelley s model model { for (j in 1:3) { for (i in 1:N) { x[i,j] ~ dnorm(tau[i,j], precx[j] ) tau[i,j] ~ dnorm( mu[j], prectau[j] ) } } precx[1] ~ dgamma(6410, 1000) precx[] ~ dgamma(750, 1000) precx[3] ~ dgamma(460, 1000) for (j in 1:3) { prectau[j] ~ dgamma(.001,.001) mu[j] ~ dnorm(mu0[j], precmu[j]) sigmae[j] <- 1/precx[j] sigmatau[j] <- 1/prectau[j] } } list(n=5838, mu0 = c(0,0,0), precmu = c(1.0e-6,1.0e-6,1.0e-6)) #insert data here x[,1] x[,] x[,3]

161 145 ##The hierarchical Kelley s model model { for (i in 1:N) { for (j in 1:3) { x[i,j] ~ dnorm(tau.stud[i, j], precx[j] ) tau.stud[i, j] ~ dnorm( tau.schl[schl[i],j], prectau[schl[i],j] ) } } precx[1] ~ dgamma(6410, 1000) precx[] ~ dgamma(750, 1000) precx[3] ~ dgamma(460, 1000) for (j in 1:3) { for (i in 1:M) { prectau[i,j] ~ dgamma(.3,.3) sigmatau[i,j] <- 1/prectau[i,j] tau.schl[i,j] ~ dnorm(mu[j],precmu[j]) } sigmae[j] <- 1/precx[j] mu[j] ~ dnorm(0,1.0e-6) precmu[j] ~ dgamma(.001,.001) sigmamu[j] <- 1/precmu[j] } } list(n=5838, M=86) #insert data here schl[] x[,1] x[,] x[,3]

162 146 ## The hierarchical Kelley s model incorporating previous information model { for (i in 1:N) { for (j in 1:3) { x[i,j] ~ dnorm(tau.stud[i, j], precx[j] ) tau.stud[i, j] ~ dnorm( tau.schl[schl[i],j], prectau[schl[i],j] ) } } precx[1] ~ dgamma(6410, 1000) precx[] ~ dgamma(750, 1000) precx[3] ~ dgamma(460, 1000) for (j in 1:3) { for (i in 1:M) { a.prectau[i,j] <- mean.prectau[i,j]*mean.prectau[i,j]/var.prectau[i,j] b.prectau[i,j] <- mean.prectau[i,j]/var.prectau[i,j] prectau[i,j] ~ dgamma(a.prectau[i,j], b.prectau[i,j]) sigmatau[i,j] <- 1/prectau[i,j] tau.schl[i,j] ~ dnorm(mu[i,j],precmu[j]) } sigmae[j] <- 1/precx[j] a.precmu[j] <- mean.precmu[j]*mean.precmu[j]/var.precmu[j] b.precmu[j] <- mean.precmu[j]/var.precmu[j] precmu[j] ~ dgamma(a.precmu[j], b.precmu[j]) sigmamu[j] <- 1/precmu[j] } }

163 147 list(n=5838, M=86, also provide mean.prectau, var.prectau, mu, mean.precmu, var.precmu based on previous data) #insert data here schl[] x[,1] x[,] x[,3]

164 148 ## Shin s model model { for (i in 1:N) { x[i,1] ~ dnorm(tau[i, 1], precx[1] ) x[i,] ~ dnorm(tau[i, ], precx[] ) x[i,3] ~ dnorm(tau[i, 3], precx[3] ) tau[i, 1:3] ~ dmnorm( mu[], prectau[,] ) } precx[1] ~ dgamma(6410, 1000) precx[] ~ dgamma(750, 1000) precx[3] ~ dgamma(460, 1000) prectau[1:3, 1:3] ~ dwish(r[,], 5 ) mu[1:3] ~ dmnorm(mu0[], precmu[,] ) sigmae[1] <- 1/precx[1] sigmae[] <- 1/precx[] sigmae[3] <- 1/precx[3] sigmatau[1:3,1:3] <- inverse(prectau[,]) } list(n=5838, mu0 = c(0,0,0), R = structure(.data = c(0.84,0.67,0.5, 0.67,0.87,0.55, 0.5,0.55,0.59),.Dim=c(3,3)), precmu = structure(.data = c(1.0e-6, 0,0,0, 1.0E-6,0,0,0, 1.0E-6),.Dim = c(3,3))) #insert data here x[,1] x[,] x[,3]

165 149 ## The hierarchical Shin s model model { for (i in 1:N) { x[i,1] ~ dnorm(tau.stud[i, 1], precx[1] ) x[i,] ~ dnorm(tau.stud[i, ], precx[] ) x[i,3] ~ dnorm(tau.stud[i, 3], precx[3] ) tau.stud[i, 1:3] ~ dmnorm( tau.schl[schl[i],], prectau[schl[i],,] ) } precx[1] ~ dgamma(6410, 1000) precx[] ~ dgamma(750, 1000) precx[3] ~ dgamma(460, 1000) for (j in 1:M) { prectau[j,1:3, 1:3] ~ dwish(r1[,], 5 ) tau.schl[j,1:3] ~ dmnorm(mu[], precmu[,] ) sigmatau[j,1:3,1:3] <- inverse(prectau[j,,]) } mu[1:3] ~ dmnorm(mu0[],precmu0[,]) precmu[1:3,1:3] ~ dwish(r[,], 5) sigmamu[1:3,1:3] <- inverse(precmu[,]) sigmae[1] <- 1/precx[1] sigmae[] <- 1/precx[] sigmae[3] <- 1/precx[3] } list(n=5838, M=86, mu0 = c(0,0,0), R1 = structure(.data = c(0.84,0.67,0.5, 0.67,0.87,0.55,

166 ,0.55,0.59),.Dim=c(3,3)), R = structure(.data = c(0.4,0.07,0.0, 0.07,0.7,0.05, 0.0,0.05,0.19),.Dim=c(3,3)), precmu0 = structure(.data = c(1.0e-6, 0,0,0, 1.0E-6,0,0,0, 1.0E-6),.Dim = c(3,3))) #insert data here schl[] x[,1] x[,] x[,3]

167 151 ## The hierarchical Shin s model incorporating previous year information model { for (i in 1:N) { x[i,1] ~ dnorm(tau.stud[i, 1], precx[1] ) x[i,] ~ dnorm(tau.stud[i, ], precx[] ) x[i,3] ~ dnorm(tau.stud[i, 3], precx[3] ) tau.stud[i, 1:3] ~ dmnorm( tau.schl[schl[i],], prectau[schl[i],,] ) } precx[1] ~ dgamma(6410, 1000) precx[] ~ dgamma(750, 1000) precx[3] ~ dgamma(460, 1000) for (j in 1:M) { prectau[j,1:3, 1:3] ~ dwish(r1[j,,], df[j] ) tau.schl[j,1:3] ~ dmnorm(mu[j,], precmu[,] ) sigmatau[j,1:3,1:3] <- inverse(prectau[j,,]) } precmu[1:3,1:3] ~ dwish(r[,], 47) sigmamu[1:3,1:3] <- inverse(precmu[,]) sigmae[1] <- 1/precx[1] sigmae[] <- 1/precx[] sigmae[3] <- 1/precx[3] } list(n=5838, M=86, also provide mu, R1, R, df based on previous data) #insert data here schl[] x[,1] x[,] x[,3]

168 15 APPENDIX B. CONVERGENCE DIAGNOSTICS FOR MODELS A, B, D, AND E tau[1,1] chains 1: iteration Figure B1 Example history plots for Model A

169 mu[1] chains 1: lag sigmae[1] chains 1: lag sigmatau[1] chains 1: lag tau[1,1] chains 1: lag Figure B Example autocorrelation plots for Model A

170 Figure B3 Example Monte Carlo error for Model A 154

171 155 mu[1] chains 1: start-iteration sigmae[1] chains 1: start-iteration sigmatau[1] chains 1: start-iteration tau[1,1] chains 1: start-iteration Figure B4 Example BGR plots for Model A

172 156 mu[1] chains 1: iteration sigmae[1] chains 1: iteration sigmamu[1] chains 1: iteration Figure B5 Example history plots for Model B

173 157 sigmatau[1,1] chains 1: iteration tau.schl[1,1] chains 1: iteration tau.stud[1,1] chains 1: iteration Figure B5 Continued

174 mu[1] chains 1: lag sigmae[1] chains 1: lag sigmamu[1] chains 1: lag sigmatau[1,1] chains 1: lag tau.schl[1,1] chains 1: lag tau.stud[1,1] chains 1: lag Figure B6 Example autocorrelation plots for Model B

175 Figure B7 Example Monte Carlo error for Model B 159

176 160 mu[1] chains 1: start-iteration sigmae[1] chains 1: start-iteration sigmamu[1] chains 1: start-iteration sigmatau[1,1] chains 1: start-iteration tau.schl[1,1] chains 1: start-iteration tau.stud[1,1] chains 1: start-iteration Figure B8 Example BGR plots for Model B

177 161 mu[1] chains 1: iteration sigmae[1] chains 1: iteration sigmatau[1,1] chains 1: iteration Figure B9 Example history plots for Model D

178 16 sigmatau[1,] chains 1: iteration tau[1,1] chains 1: iteration Figure B9 Continued

179 mu[1] chains 1: lag sigmae[1] chains 1: lag sigmatau[1,1] chains 1: lag sigmatau[1,] chains 1: lag tau[1,1] chains 1: lag Figure B10 Example autocorrelation plots for Model D

180 Figure B11 Example Monte Carlo error for Model D 164

181 165 mu[1] chains 1: start-iteration sigmae[1] chains 1: start-iteration sigmatau[1,1] chains 1: start-iteration sigmatau[1,] chains 1: start-iteration tau[1,1] chains 1: start-iteration Figure B1 Example BGR plots for Model D

182 166 mu[1] chains 1: E iteration sigmae[1] chains 1: iteration sigmamu[1,1] chains 1: iteration Figure B13 Example history plots for Model E

183 167 sigmamu[1,] chains 1: iteration sigmatau[1,1,1] chains 1: iteration sigmatau[1,1,] chains 1: iteration Figure B13 Continued

184 168 tau.schl[1,1] chains 1: iteration tau.stud[1,1] chains 1: iteration Figure B13 Continued

185 mu[1] chains 1: lag sigmae[1] chains 1: lag sigmamu[1,1] chains 1: lag sigmamu[1,] chains 1: lag sigmatau[1,1,1] chains 1: lag sigmatau[1,1,] chains 1: lag tau.schl[1,1] chains 1: lag tau.schl[1,1] chains 1: lag Figure B14 Example autocorrelation plots for Model E

186 Figure B15 Example Monte Carlo error for Model E 170

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests David Shin Pearson Educational Measurement May 007 rr0701 Using assessment and research to promote learning Pearson Educational

More information

IDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS

IDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS IDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS A Dissertation Presented to The Academic Faculty by HeaWon Jun In Partial Fulfillment of the Requirements

More information

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS Michael J. Kolen The University of Iowa March 2011 Commissioned by the Center for K 12 Assessment & Performance Management at

More information

An Empirical Bayes Approach to Subscore Augmentation: How Much Strength Can We Borrow?

An Empirical Bayes Approach to Subscore Augmentation: How Much Strength Can We Borrow? Journal of Educational and Behavioral Statistics Fall 2006, Vol. 31, No. 3, pp. 241 259 An Empirical Bayes Approach to Subscore Augmentation: How Much Strength Can We Borrow? Michael C. Edwards The Ohio

More information

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison Empowered by Psychometrics The Fundamentals of Psychometrics Jim Wollack University of Wisconsin Madison Psycho-what? Psychometrics is the field of study concerned with the measurement of mental and psychological

More information

Bayesian and Frequentist Approaches

Bayesian and Frequentist Approaches Bayesian and Frequentist Approaches G. Jogesh Babu Penn State University http://sites.stat.psu.edu/ babu http://astrostatistics.psu.edu All models are wrong But some are useful George E. P. Box (son-in-law

More information

A Brief Introduction to Bayesian Statistics

A Brief Introduction to Bayesian Statistics A Brief Introduction to Statistics David Kaplan Department of Educational Psychology Methods for Social Policy Research and, Washington, DC 2017 1 / 37 The Reverend Thomas Bayes, 1701 1761 2 / 37 Pierre-Simon

More information

Instrumental Variables Estimation: An Introduction

Instrumental Variables Estimation: An Introduction Instrumental Variables Estimation: An Introduction Susan L. Ettner, Ph.D. Professor Division of General Internal Medicine and Health Services Research, UCLA The Problem The Problem Suppose you wish to

More information

Introduction to Bayesian Analysis 1

Introduction to Bayesian Analysis 1 Biostats VHM 801/802 Courses Fall 2005, Atlantic Veterinary College, PEI Henrik Stryhn Introduction to Bayesian Analysis 1 Little known outside the statistical science, there exist two different approaches

More information

Technical Specifications

Technical Specifications Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically

More information

Having your cake and eating it too: multiple dimensions and a composite

Having your cake and eating it too: multiple dimensions and a composite Having your cake and eating it too: multiple dimensions and a composite Perman Gochyyev and Mark Wilson UC Berkeley BEAR Seminar October, 2018 outline Motivating example Different modeling approaches Composite

More information

Statistical Methods and Reasoning for the Clinical Sciences

Statistical Methods and Reasoning for the Clinical Sciences Statistical Methods and Reasoning for the Clinical Sciences Evidence-Based Practice Eiki B. Satake, PhD Contents Preface Introduction to Evidence-Based Statistics: Philosophical Foundation and Preliminaries

More information

Advanced Bayesian Models for the Social Sciences. TA: Elizabeth Menninga (University of North Carolina, Chapel Hill)

Advanced Bayesian Models for the Social Sciences. TA: Elizabeth Menninga (University of North Carolina, Chapel Hill) Advanced Bayesian Models for the Social Sciences Instructors: Week 1&2: Skyler J. Cranmer Department of Political Science University of North Carolina, Chapel Hill skyler@unc.edu Week 3&4: Daniel Stegmueller

More information

Advanced Bayesian Models for the Social Sciences

Advanced Bayesian Models for the Social Sciences Advanced Bayesian Models for the Social Sciences Jeff Harden Department of Political Science, University of Colorado Boulder jeffrey.harden@colorado.edu Daniel Stegmueller Department of Government, University

More information

A Bayesian Nonparametric Model Fit statistic of Item Response Models

A Bayesian Nonparametric Model Fit statistic of Item Response Models A Bayesian Nonparametric Model Fit statistic of Item Response Models Purpose As more and more states move to use the computer adaptive test for their assessments, item response theory (IRT) has been widely

More information

Small-area estimation of mental illness prevalence for schools

Small-area estimation of mental illness prevalence for schools Small-area estimation of mental illness prevalence for schools Fan Li 1 Alan Zaslavsky 2 1 Department of Statistical Science Duke University 2 Department of Health Care Policy Harvard Medical School March

More information

Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination

Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination Timothy N. Rubin (trubin@uci.edu) Michael D. Lee (mdlee@uci.edu) Charles F. Chubb (cchubb@uci.edu) Department of Cognitive

More information

You must answer question 1.

You must answer question 1. Research Methods and Statistics Specialty Area Exam October 28, 2015 Part I: Statistics Committee: Richard Williams (Chair), Elizabeth McClintock, Sarah Mustillo You must answer question 1. 1. Suppose

More information

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2 MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and Lord Equating Methods 1,2 Lisa A. Keller, Ronald K. Hambleton, Pauline Parker, Jenna Copella University of Massachusetts

More information

Statistics for Social and Behavioral Sciences

Statistics for Social and Behavioral Sciences Statistics for Social and Behavioral Sciences Advisors: S.E. Fienberg W.J. van der Linden For other titles published in this series, go to http://www.springer.com/series/3463 Jean-Paul Fox Bayesian Item

More information

Multilevel IRT for group-level diagnosis. Chanho Park Daniel M. Bolt. University of Wisconsin-Madison

Multilevel IRT for group-level diagnosis. Chanho Park Daniel M. Bolt. University of Wisconsin-Madison Group-Level Diagnosis 1 N.B. Please do not cite or distribute. Multilevel IRT for group-level diagnosis Chanho Park Daniel M. Bolt University of Wisconsin-Madison Paper presented at the annual meeting

More information

Lec 02: Estimation & Hypothesis Testing in Animal Ecology

Lec 02: Estimation & Hypothesis Testing in Animal Ecology Lec 02: Estimation & Hypothesis Testing in Animal Ecology Parameter Estimation from Samples Samples We typically observe systems incompletely, i.e., we sample according to a designed protocol. We then

More information

Bruno D. Zumbo, Ph.D. University of Northern British Columbia

Bruno D. Zumbo, Ph.D. University of Northern British Columbia Bruno Zumbo 1 The Effect of DIF and Impact on Classical Test Statistics: Undetected DIF and Impact, and the Reliability and Interpretability of Scores from a Language Proficiency Test Bruno D. Zumbo, Ph.D.

More information

Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm

Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm Journal of Social and Development Sciences Vol. 4, No. 4, pp. 93-97, Apr 203 (ISSN 222-52) Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm Henry De-Graft Acquah University

More information

Catherine A. Welch 1*, Séverine Sabia 1,2, Eric Brunner 1, Mika Kivimäki 1 and Martin J. Shipley 1

Catherine A. Welch 1*, Séverine Sabia 1,2, Eric Brunner 1, Mika Kivimäki 1 and Martin J. Shipley 1 Welch et al. BMC Medical Research Methodology (2018) 18:89 https://doi.org/10.1186/s12874-018-0548-0 RESEARCH ARTICLE Open Access Does pattern mixture modelling reduce bias due to informative attrition

More information

How many speakers? How many tokens?:

How many speakers? How many tokens?: 1 NWAV 38- Ottawa, Canada 23/10/09 How many speakers? How many tokens?: A methodological contribution to the study of variation. Jorge Aguilar-Sánchez University of Wisconsin-La Crosse 2 Sample size in

More information

Linking across forms in vertical scaling under the common-item nonequvalent groups design

Linking across forms in vertical scaling under the common-item nonequvalent groups design University of Iowa Iowa Research Online Theses and Dissertations Spring 2013 Linking across forms in vertical scaling under the common-item nonequvalent groups design Xuan Wang University of Iowa Copyright

More information

Combining Risks from Several Tumors Using Markov Chain Monte Carlo

Combining Risks from Several Tumors Using Markov Chain Monte Carlo University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln U.S. Environmental Protection Agency Papers U.S. Environmental Protection Agency 2009 Combining Risks from Several Tumors

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 39 Evaluation of Comparability of Scores and Passing Decisions for Different Item Pools of Computerized Adaptive Examinations

More information

Individual Differences in Attention During Category Learning

Individual Differences in Attention During Category Learning Individual Differences in Attention During Category Learning Michael D. Lee (mdlee@uci.edu) Department of Cognitive Sciences, 35 Social Sciences Plaza A University of California, Irvine, CA 92697-5 USA

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report. Assessing IRT Model-Data Fit for Mixed Format Tests

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report. Assessing IRT Model-Data Fit for Mixed Format Tests Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 26 for Mixed Format Tests Kyong Hee Chon Won-Chan Lee Timothy N. Ansley November 2007 The authors are grateful to

More information

Measuring and Assessing Study Quality

Measuring and Assessing Study Quality Measuring and Assessing Study Quality Jeff Valentine, PhD Co-Chair, Campbell Collaboration Training Group & Associate Professor, College of Education and Human Development, University of Louisville Why

More information

A Hierarchical Linear Modeling Approach for Detecting Cheating and Aberrance. William Skorupski. University of Kansas. Karla Egan.

A Hierarchical Linear Modeling Approach for Detecting Cheating and Aberrance. William Skorupski. University of Kansas. Karla Egan. HLM Cheating 1 A Hierarchical Linear Modeling Approach for Detecting Cheating and Aberrance William Skorupski University of Kansas Karla Egan CTB/McGraw-Hill Paper presented at the May, 2012 Conference

More information

ST440/550: Applied Bayesian Statistics. (10) Frequentist Properties of Bayesian Methods

ST440/550: Applied Bayesian Statistics. (10) Frequentist Properties of Bayesian Methods (10) Frequentist Properties of Bayesian Methods Calibrated Bayes So far we have discussed Bayesian methods as being separate from the frequentist approach However, in many cases methods with frequentist

More information

Chapter 02. Basic Research Methodology

Chapter 02. Basic Research Methodology Chapter 02 Basic Research Methodology Definition RESEARCH Research is a quest for knowledge through diligent search or investigation or experimentation aimed at the discovery and interpretation of new

More information

LUO, XIAO, Ph.D. The Optimal Design of the Dual-purpose Test. (2013) Directed by Dr. Richard M. Luecht. 155 pp.

LUO, XIAO, Ph.D. The Optimal Design of the Dual-purpose Test. (2013) Directed by Dr. Richard M. Luecht. 155 pp. LUO, XIAO, Ph.D. The Optimal Design of the Dual-purpose Test. (2013) Directed by Dr. Richard M. Luecht. 155 pp. Traditional test development focused on one purpose of the test, either ranking test-takers

More information

Bayesian Models for Combining Data Across Subjects and Studies in Predictive fmri Data Analysis

Bayesian Models for Combining Data Across Subjects and Studies in Predictive fmri Data Analysis Bayesian Models for Combining Data Across Subjects and Studies in Predictive fmri Data Analysis Thesis Proposal Indrayana Rustandi April 3, 2007 Outline Motivation and Thesis Preliminary results: Hierarchical

More information

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian Recovery of Marginal Maximum Likelihood Estimates in the Two-Parameter Logistic Response Model: An Evaluation of MULTILOG Clement A. Stone University of Pittsburgh Marginal maximum likelihood (MML) estimation

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 10: Introduction to inference (v2) Ramesh Johari ramesh.johari@stanford.edu 1 / 17 What is inference? 2 / 17 Where did our data come from? Recall our sample is: Y, the vector

More information

Introduction. Patrick Breheny. January 10. The meaning of probability The Bayesian approach Preview of MCMC methods

Introduction. Patrick Breheny. January 10. The meaning of probability The Bayesian approach Preview of MCMC methods Introduction Patrick Breheny January 10 Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 1/25 Introductory example: Jane s twins Suppose you have a friend named Jane who is pregnant with twins

More information

Citation for published version (APA): Ebbes, P. (2004). Latent instrumental variables: a new approach to solve for endogeneity s.n.

Citation for published version (APA): Ebbes, P. (2004). Latent instrumental variables: a new approach to solve for endogeneity s.n. University of Groningen Latent instrumental variables Ebbes, P. IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document

More information

Decision consistency and accuracy indices for the bifactor and testlet response theory models

Decision consistency and accuracy indices for the bifactor and testlet response theory models University of Iowa Iowa Research Online Theses and Dissertations Summer 2014 Decision consistency and accuracy indices for the bifactor and testlet response theory models Lee James LaFond University of

More information

Linking Assessments: Concept and History

Linking Assessments: Concept and History Linking Assessments: Concept and History Michael J. Kolen, University of Iowa In this article, the history of linking is summarized, and current linking frameworks that have been proposed are considered.

More information

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories Kamla-Raj 010 Int J Edu Sci, (): 107-113 (010) Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories O.O. Adedoyin Department of Educational Foundations,

More information

Statistical Tolerance Regions: Theory, Applications and Computation

Statistical Tolerance Regions: Theory, Applications and Computation Statistical Tolerance Regions: Theory, Applications and Computation K. KRISHNAMOORTHY University of Louisiana at Lafayette THOMAS MATHEW University of Maryland Baltimore County Contents List of Tables

More information

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES Correlational Research Correlational Designs Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are

More information

Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales

Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales University of Iowa Iowa Research Online Theses and Dissertations Summer 2013 Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales Anna Marie

More information

Unit 1 Exploring and Understanding Data

Unit 1 Exploring and Understanding Data Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile

More information

CHAPTER VI RESEARCH METHODOLOGY

CHAPTER VI RESEARCH METHODOLOGY CHAPTER VI RESEARCH METHODOLOGY 6.1 Research Design Research is an organized, systematic, data based, critical, objective, scientific inquiry or investigation into a specific problem, undertaken with the

More information

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD Psy 427 Cal State Northridge Andrew Ainsworth, PhD Contents Item Analysis in General Classical Test Theory Item Response Theory Basics Item Response Functions Item Information Functions Invariance IRT

More information

On Test Scores (Part 2) How to Properly Use Test Scores in Secondary Analyses. Structural Equation Modeling Lecture #12 April 29, 2015

On Test Scores (Part 2) How to Properly Use Test Scores in Secondary Analyses. Structural Equation Modeling Lecture #12 April 29, 2015 On Test Scores (Part 2) How to Properly Use Test Scores in Secondary Analyses Structural Equation Modeling Lecture #12 April 29, 2015 PRE 906, SEM: On Test Scores #2--The Proper Use of Scores Today s Class:

More information

Estimating Reliability in Primary Research. Michael T. Brannick. University of South Florida

Estimating Reliability in Primary Research. Michael T. Brannick. University of South Florida Estimating Reliability 1 Running Head: ESTIMATING RELIABILITY Estimating Reliability in Primary Research Michael T. Brannick University of South Florida Paper presented in S. Morris (Chair) Advances in

More information

EPSE 594: Meta-Analysis: Quantitative Research Synthesis

EPSE 594: Meta-Analysis: Quantitative Research Synthesis EPSE 594: Meta-Analysis: Quantitative Research Synthesis Ed Kroc University of British Columbia ed.kroc@ubc.ca March 28, 2019 Ed Kroc (UBC) EPSE 594 March 28, 2019 1 / 32 Last Time Publication bias Funnel

More information

For general queries, contact

For general queries, contact Much of the work in Bayesian econometrics has focused on showing the value of Bayesian methods for parametric models (see, for example, Geweke (2005), Koop (2003), Li and Tobias (2011), and Rossi, Allenby,

More information

Can We Assess Formative Measurement using Item Weights? A Monte Carlo Simulation Analysis

Can We Assess Formative Measurement using Item Weights? A Monte Carlo Simulation Analysis Association for Information Systems AIS Electronic Library (AISeL) MWAIS 2013 Proceedings Midwest (MWAIS) 5-24-2013 Can We Assess Formative Measurement using Item Weights? A Monte Carlo Simulation Analysis

More information

Bayesian Inference Bayes Laplace

Bayesian Inference Bayes Laplace Bayesian Inference Bayes Laplace Course objective The aim of this course is to introduce the modern approach to Bayesian statistics, emphasizing the computational aspects and the differences between the

More information

Diagnostic Classification Models

Diagnostic Classification Models Diagnostic Classification Models Lecture #13 ICPSR Item Response Theory Workshop Lecture #13: 1of 86 Lecture Overview Key definitions Conceptual example Example uses of diagnostic models in education Classroom

More information

linking in educational measurement: Taking differential motivation into account 1

linking in educational measurement: Taking differential motivation into account 1 Selecting a data collection design for linking in educational measurement: Taking differential motivation into account 1 Abstract In educational measurement, multiple test forms are often constructed to

More information

Differential Item Functioning

Differential Item Functioning Differential Item Functioning Lecture #11 ICPSR Item Response Theory Workshop Lecture #11: 1of 62 Lecture Overview Detection of Differential Item Functioning (DIF) Distinguish Bias from DIF Test vs. Item

More information

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Thakur Karkee Measurement Incorporated Dong-In Kim CTB/McGraw-Hill Kevin Fatica CTB/McGraw-Hill

More information

Selection of Linking Items

Selection of Linking Items Selection of Linking Items Subset of items that maximally reflect the scale information function Denote the scale information as Linear programming solver (in R, lp_solve 5.5) min(y) Subject to θ, θs,

More information

André Cyr and Alexander Davies

André Cyr and Alexander Davies Item Response Theory and Latent variable modeling for surveys with complex sampling design The case of the National Longitudinal Survey of Children and Youth in Canada Background André Cyr and Alexander

More information

RELIABILITY OF THE DENISON ORGANISATIONAL CULTURE SURVEY (DOCS) FOR USE IN A FINANCIAL INSTITUTION IN SOUTH AFRICA CHRISSTOFFEL JACOBUS FRANCK

RELIABILITY OF THE DENISON ORGANISATIONAL CULTURE SURVEY (DOCS) FOR USE IN A FINANCIAL INSTITUTION IN SOUTH AFRICA CHRISSTOFFEL JACOBUS FRANCK RELIABILITY OF THE DENISON ORGANISATIONAL CULTURE SURVEY (DOCS) FOR USE IN A FINANCIAL INSTITUTION IN SOUTH AFRICA by CHRISSTOFFEL JACOBUS FRANCK submitted in part fulfilment of the requirements for the

More information

Rasch Versus Birnbaum: New Arguments in an Old Debate

Rasch Versus Birnbaum: New Arguments in an Old Debate White Paper Rasch Versus Birnbaum: by John Richard Bergan, Ph.D. ATI TM 6700 E. Speedway Boulevard Tucson, Arizona 85710 Phone: 520.323.9033 Fax: 520.323.9139 Copyright 2013. All rights reserved. Galileo

More information

Bayes Linear Statistics. Theory and Methods

Bayes Linear Statistics. Theory and Methods Bayes Linear Statistics Theory and Methods Michael Goldstein and David Wooff Durham University, UK BICENTENNI AL BICENTENNIAL Contents r Preface xvii 1 The Bayes linear approach 1 1.1 Combining beliefs

More information

The SAGE Encyclopedia of Educational Research, Measurement, and Evaluation Multivariate Analysis of Variance

The SAGE Encyclopedia of Educational Research, Measurement, and Evaluation Multivariate Analysis of Variance The SAGE Encyclopedia of Educational Research, Measurement, Multivariate Analysis of Variance Contributors: David W. Stockburger Edited by: Bruce B. Frey Book Title: Chapter Title: "Multivariate Analysis

More information

CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA

CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA Data Analysis: Describing Data CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA In the analysis process, the researcher tries to evaluate the data collected both from written documents and from other sources such

More information

Macroeconometric Analysis. Chapter 1. Introduction

Macroeconometric Analysis. Chapter 1. Introduction Macroeconometric Analysis Chapter 1. Introduction Chetan Dave David N. DeJong 1 Background The seminal contribution of Kydland and Prescott (1982) marked the crest of a sea change in the way macroeconomists

More information

S Imputation of Categorical Missing Data: A comparison of Multivariate Normal and. Multinomial Methods. Holmes Finch.

S Imputation of Categorical Missing Data: A comparison of Multivariate Normal and. Multinomial Methods. Holmes Finch. S05-2008 Imputation of Categorical Missing Data: A comparison of Multivariate Normal and Abstract Multinomial Methods Holmes Finch Matt Margraf Ball State University Procedures for the imputation of missing

More information

A BAYESIAN SOLUTION FOR THE LAW OF CATEGORICAL JUDGMENT WITH CATEGORY BOUNDARY VARIABILITY AND EXAMINATION OF ROBUSTNESS TO MODEL VIOLATIONS

A BAYESIAN SOLUTION FOR THE LAW OF CATEGORICAL JUDGMENT WITH CATEGORY BOUNDARY VARIABILITY AND EXAMINATION OF ROBUSTNESS TO MODEL VIOLATIONS A BAYESIAN SOLUTION FOR THE LAW OF CATEGORICAL JUDGMENT WITH CATEGORY BOUNDARY VARIABILITY AND EXAMINATION OF ROBUSTNESS TO MODEL VIOLATIONS A Thesis Presented to The Academic Faculty by David R. King

More information

Fundamental Concepts for Using Diagnostic Classification Models. Section #2 NCME 2016 Training Session. NCME 2016 Training Session: Section 2

Fundamental Concepts for Using Diagnostic Classification Models. Section #2 NCME 2016 Training Session. NCME 2016 Training Session: Section 2 Fundamental Concepts for Using Diagnostic Classification Models Section #2 NCME 2016 Training Session NCME 2016 Training Session: Section 2 Lecture Overview Nature of attributes What s in a name? Grain

More information

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1 Nested Factor Analytic Model Comparison as a Means to Detect Aberrant Response Patterns John M. Clark III Pearson Author Note John M. Clark III,

More information

AN EXPLORATORY STUDY OF LEADER-MEMBER EXCHANGE IN CHINA, AND THE ROLE OF GUANXI IN THE LMX PROCESS

AN EXPLORATORY STUDY OF LEADER-MEMBER EXCHANGE IN CHINA, AND THE ROLE OF GUANXI IN THE LMX PROCESS UNIVERSITY OF SOUTHERN QUEENSLAND AN EXPLORATORY STUDY OF LEADER-MEMBER EXCHANGE IN CHINA, AND THE ROLE OF GUANXI IN THE LMX PROCESS A Dissertation submitted by Gwenda Latham, MBA For the award of Doctor

More information

Copyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and

Copyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and Copyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and private study only. The thesis may not be reproduced elsewhere

More information

Context of Best Subset Regression

Context of Best Subset Regression Estimation of the Squared Cross-Validity Coefficient in the Context of Best Subset Regression Eugene Kennedy South Carolina Department of Education A monte carlo study was conducted to examine the performance

More information

Rhonda L. White. Doctoral Committee:

Rhonda L. White. Doctoral Committee: THE ASSOCIATION OF SOCIAL RESPONSIBILITY ENDORSEMENT WITH RACE-RELATED EXPERIENCES, RACIAL ATTITUDES, AND PSYCHOLOGICAL OUTCOMES AMONG BLACK COLLEGE STUDENTS by Rhonda L. White A dissertation submitted

More information

Modelling Spatially Correlated Survival Data for Individuals with Multiple Cancers

Modelling Spatially Correlated Survival Data for Individuals with Multiple Cancers Modelling Spatially Correlated Survival Data for Individuals with Multiple Cancers Dipak K. Dey, Ulysses Diva and Sudipto Banerjee Department of Statistics University of Connecticut, Storrs. March 16,

More information

A Multilevel Testlet Model for Dual Local Dependence

A Multilevel Testlet Model for Dual Local Dependence Journal of Educational Measurement Spring 2012, Vol. 49, No. 1, pp. 82 100 A Multilevel Testlet Model for Dual Local Dependence Hong Jiao University of Maryland Akihito Kamata University of Oregon Shudong

More information

SPRING GROVE AREA SCHOOL DISTRICT. Course Description. Instructional Strategies, Learning Practices, Activities, and Experiences.

SPRING GROVE AREA SCHOOL DISTRICT. Course Description. Instructional Strategies, Learning Practices, Activities, and Experiences. SPRING GROVE AREA SCHOOL DISTRICT PLANNED COURSE OVERVIEW Course Title: Basic Introductory Statistics Grade Level(s): 11-12 Units of Credit: 1 Classification: Elective Length of Course: 30 cycles Periods

More information

Ordinal Data Modeling

Ordinal Data Modeling Valen E. Johnson James H. Albert Ordinal Data Modeling With 73 illustrations I ". Springer Contents Preface v 1 Review of Classical and Bayesian Inference 1 1.1 Learning about a binomial proportion 1 1.1.1

More information

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, April 23-25, 2003 The Classification Accuracy of Measurement Decision Theory Lawrence Rudner University

More information

Chapter 23. Inference About Means. Copyright 2010 Pearson Education, Inc.

Chapter 23. Inference About Means. Copyright 2010 Pearson Education, Inc. Chapter 23 Inference About Means Copyright 2010 Pearson Education, Inc. Getting Started Now that we know how to create confidence intervals and test hypotheses about proportions, it d be nice to be able

More information

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments Greg Pope, Analytics and Psychometrics Manager 2008 Users Conference San Antonio Introduction and purpose of this session

More information

alternate-form reliability The degree to which two or more versions of the same test correlate with one another. In clinical studies in which a given function is going to be tested more than once over

More information

Bayesian Estimation of a Meta-analysis model using Gibbs sampler

Bayesian Estimation of a Meta-analysis model using Gibbs sampler University of Wollongong Research Online Applied Statistics Education and Research Collaboration (ASEARC) - Conference Papers Faculty of Engineering and Information Sciences 2012 Bayesian Estimation of

More information

MEA DISCUSSION PAPERS

MEA DISCUSSION PAPERS Inference Problems under a Special Form of Heteroskedasticity Helmut Farbmacher, Heinrich Kögel 03-2015 MEA DISCUSSION PAPERS mea Amalienstr. 33_D-80799 Munich_Phone+49 89 38602-355_Fax +49 89 38602-390_www.mea.mpisoc.mpg.de

More information

Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study

Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study Research Report Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study Xueli Xu Matthias von Davier April 2010 ETS RR-10-10 Listening. Learning. Leading. Linking Errors in Trend Estimation

More information

Development, Standardization and Application of

Development, Standardization and Application of American Journal of Educational Research, 2018, Vol. 6, No. 3, 238-257 Available online at http://pubs.sciepub.com/education/6/3/11 Science and Education Publishing DOI:10.12691/education-6-3-11 Development,

More information

Answers to end of chapter questions

Answers to end of chapter questions Answers to end of chapter questions Chapter 1 What are the three most important characteristics of QCA as a method of data analysis? QCA is (1) systematic, (2) flexible, and (3) it reduces data. What are

More information

Item Response Theory: Methods for the Analysis of Discrete Survey Response Data

Item Response Theory: Methods for the Analysis of Discrete Survey Response Data Item Response Theory: Methods for the Analysis of Discrete Survey Response Data ICPSR Summer Workshop at the University of Michigan June 29, 2015 July 3, 2015 Presented by: Dr. Jonathan Templin Department

More information

Methods for Addressing Selection Bias in Observational Studies

Methods for Addressing Selection Bias in Observational Studies Methods for Addressing Selection Bias in Observational Studies Susan L. Ettner, Ph.D. Professor Division of General Internal Medicine and Health Services Research, UCLA What is Selection Bias? In the regression

More information

Still important ideas

Still important ideas Readings: OpenStax - Chapters 1 13 & Appendix D & E (online) Plous Chapters 17 & 18 - Chapter 17: Social Influences - Chapter 18: Group Judgments and Decisions Still important ideas Contrast the measurement

More information

Ecological Statistics

Ecological Statistics A Primer of Ecological Statistics Second Edition Nicholas J. Gotelli University of Vermont Aaron M. Ellison Harvard Forest Sinauer Associates, Inc. Publishers Sunderland, Massachusetts U.S.A. Brief Contents

More information

Probability II. Patrick Breheny. February 15. Advanced rules Summary

Probability II. Patrick Breheny. February 15. Advanced rules Summary Probability II Patrick Breheny February 15 Patrick Breheny University of Iowa Introduction to Biostatistics (BIOS 4120) 1 / 26 A rule related to the addition rule is called the law of total probability,

More information

Response to Comment on Cognitive Science in the field: Does exercising core mathematical concepts improve school readiness?

Response to Comment on Cognitive Science in the field: Does exercising core mathematical concepts improve school readiness? Response to Comment on Cognitive Science in the field: Does exercising core mathematical concepts improve school readiness? Authors: Moira R. Dillon 1 *, Rachael Meager 2, Joshua T. Dean 3, Harini Kannan

More information

Blending Psychometrics with Bayesian Inference Networks: Measuring Hundreds of Latent Variables Simultaneously

Blending Psychometrics with Bayesian Inference Networks: Measuring Hundreds of Latent Variables Simultaneously Blending Psychometrics with Bayesian Inference Networks: Measuring Hundreds of Latent Variables Simultaneously Jonathan Templin Department of Educational Psychology Achievement and Assessment Institute

More information

An Introduction to Bayesian Statistics

An Introduction to Bayesian Statistics An Introduction to Bayesian Statistics Robert Weiss Department of Biostatistics UCLA Fielding School of Public Health robweiss@ucla.edu Sept 2015 Robert Weiss (UCLA) An Introduction to Bayesian Statistics

More information

Applying Bayesian Ordinal Regression to ICAP Maladaptive Behavior Subscales

Applying Bayesian Ordinal Regression to ICAP Maladaptive Behavior Subscales Brigham Young University BYU ScholarsArchive All Theses and Dissertations 2007-10-25 Applying Bayesian Ordinal Regression to ICAP Maladaptive Behavior Subscales Edward P. Johnson Brigham Young University

More information

Modeling Change in Recognition Bias with the Progression of Alzheimer s

Modeling Change in Recognition Bias with the Progression of Alzheimer s Modeling Change in Recognition Bias with the Progression of Alzheimer s James P. Pooley (jpooley@uci.edu) Michael D. Lee (mdlee@uci.edu) Department of Cognitive Sciences, 3151 Social Science Plaza University

More information

Research Questions and Survey Development

Research Questions and Survey Development Research Questions and Survey Development R. Eric Heidel, PhD Associate Professor of Biostatistics Department of Surgery University of Tennessee Graduate School of Medicine Research Questions 1 Research

More information