Selection of Linking Items

Similar documents
Item Selection in Polytomous CAT

Computerized Adaptive Testing With the Bifactor Model

Termination Criteria in Computerized Adaptive Tests: Variable-Length CATs Are Not Biased. Ben Babcock and David J. Weiss University of Minnesota

Constrained Multidimensional Adaptive Testing without intermixing items from different dimensions

Design for Targeted Therapies: Statistical Considerations

Impact of Violation of the Missing-at-Random Assumption on Full-Information Maximum Likelihood Method in Multidimensional Adaptive Testing

On indirect measurement of health based on survey data. Responses to health related questions (items) Y 1,..,Y k A unidimensional latent health state

Measurement Efficiency for Fixed-Precision Multidimensional Computerized Adaptive Tests Paap, Muirne C. S.; Born, Sebastian; Braeken, Johan

Small-area estimation of mental illness prevalence for schools

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests

SLEEP DISTURBANCE ABOUT SLEEP DISTURBANCE INTRODUCTION TO ASSESSMENT OPTIONS. 6/27/2018 PROMIS Sleep Disturbance Page 1

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria

Applications. DSC 410/510 Multivariate Statistical Methods. Discriminating Two Groups. What is Discriminant Analysis

Lec 02: Estimation & Hypothesis Testing in Animal Ecology

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses

Panel: Using Structural Equation Modeling (SEM) Using Partial Least Squares (SmartPLS)

Effects of Ignoring Discrimination Parameter in CAT Item Selection on Student Scores. Shudong Wang NWEA. Liru Zhang Delaware Department of Education

Bayesian Confidence Intervals for Means and Variances of Lognormal and Bivariate Lognormal Distributions

Challenges in Developing Learning Algorithms to Personalize mhealth Treatments

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups

Paul Irwing, Manchester Business School

A COMPARISON OF IMPUTATION METHODS FOR MISSING DATA IN A MULTI-CENTER RANDOMIZED CLINICAL TRIAL: THE IMPACT STUDY

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

Classical Psychophysical Methods (cont.)

ABOUT PHYSICAL ACTIVITY

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016

Technical Specifications

Multilevel IRT for group-level diagnosis. Chanho Park Daniel M. Bolt. University of Wisconsin-Madison

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

Decision consistency and accuracy indices for the bifactor and testlet response theory models

Comparison of Computerized Adaptive Testing and Classical Methods for Measuring Individual Change

A Comparison of Item and Testlet Selection Procedures. in Computerized Adaptive Testing. Leslie Keng. Pearson. Tsung-Han Ho

Bayesian Nonparametric Methods for Precision Medicine

Outlier Analysis. Lijun Zhang

Patient Reported Outcomes in Clinical Research. Overview 11/30/2015. Why measure patientreported

Analyzing data from educational surveys: a comparison of HLM and Multilevel IRT. Amin Mousavi

T-Statistic-based Up&Down Design for Dose-Finding Competes Favorably with Bayesian 4-parameter Logistic Design

Using the Score-based Testlet Method to Handle Local Item Dependence

USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION

A Bayesian Nonparametric Model Fit statistic of Item Response Models

The Psychometric Properties of Dispositional Flow Scale-2 in Internet Gaming

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking

Computerized Adaptive Testing

On Test Scores (Part 2) How to Properly Use Test Scores in Secondary Analyses. Structural Equation Modeling Lecture #12 April 29, 2015

FATIGUE. A brief guide to the PROMIS Fatigue instruments:

Table of Contents. Preface to the third edition xiii. Preface to the second edition xv. Preface to the fi rst edition xvii. List of abbreviations xix

Application of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties

The use of predicted values for item parameters in item response theory models: an. application in intelligence tests

Author's response to reviews

The Relative Performance of Full Information Maximum Likelihood Estimation for Missing Data in Structural Equation Models

Michael Hallquist, Thomas M. Olino, Paul A. Pilkonis University of Pittsburgh

PSYCHOLOGICAL STRESS EXPERIENCES

Learning from data when all models are wrong

INTRODUCTION TO ASSESSMENT OPTIONS

Smoking Social Motivations

Basic concepts and principles of classical test theory

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida

Statistical Audit. Summary. Conceptual and. framework. MICHAELA SAISANA and ANDREA SALTELLI European Commission Joint Research Centre (Ispra, Italy)

Modelling Spatially Correlated Survival Data for Individuals with Multiple Cancers

Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study

Bayesian Statistics Estimation of a Single Mean and Variance MCMC Diagnostics and Missing Data

Associate Prof. Dr Anne Yee. Dr Mahmoud Danaee

Introduction to Item Response Theory

An Empirical Bayes Approach to Subscore Augmentation: How Much Strength Can We Borrow?

PHYSICAL STRESS EXPERIENCES

A Modified CATSIB Procedure for Detecting Differential Item Function. on Computer-Based Tests. Johnson Ching-hong Li 1. Mark J. Gierl 1.

INTRODUCTION TO ASSESSMENT OPTIONS

A Brief Introduction to Bayesian Statistics

Introduction to Statistical Data Analysis I

PSYCH-GA.2211/NEURL-GA.2201 Fall 2016 Mathematical Tools for Cognitive and Neural Science. Homework 5

BayesOpt: Extensions and applications

Factor Analysis of Gulf War Illness: What Does It Add to Our Understanding of Possible Health Effects of Deployment?

ABOUT SMOKING NEGATIVE PSYCHOSOCIAL EXPECTANCIES

MEANING AND PURPOSE. ADULT PEDIATRIC PARENT PROXY PROMIS Item Bank v1.0 Meaning and Purpose PROMIS Short Form v1.0 Meaning and Purpose 4a

Nonparametric IRT analysis of Quality-of-Life Scales and its application to the World Health Organization Quality-of-Life Scale (WHOQOL-Bref)

Multidimensional Modeling of Learning Progression-based Vertical Scales 1

Biostatistical modelling in genomics for clinical cancer studies

10-1 MMSE Estimation S. Lall, Stanford

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati.

Scaling TOWES and Linking to IALS

Clinical trials with incomplete daily diary data

IDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

10CS664: PATTERN RECOGNITION QUESTION BANK

Identification of Tissue Independent Cancer Driver Genes

PHYSICAL FUNCTION A brief guide to the PROMIS Physical Function instruments:

PACKER: An Exemplar Model of Category Generation

investigate. educate. inform.

Scoring Multiple Choice Items: A Comparison of IRT and Classical Polytomous and Dichotomous Methods

PAIN INTERFERENCE. ADULT ADULT CANCER PEDIATRIC PARENT PROXY PROMIS-Ca Bank v1.1 Pain Interference PROMIS-Ca Bank v1.0 Pain Interference*

Centre for Education Research and Policy

Computerized Mastery Testing

An Introduction to Bayesian Statistics

Differential Item Functioning

Item Response Theory. Author's personal copy. Glossary

Transcription:

Selection of Linking Items Subset of items that maximally reflect the scale information function Denote the scale information as Linear programming solver (in R, lp_solve 5.5) min(y) Subject to θ, θs, where 4, 3.95,, 3.95, 4}, 0, 1,, 0. 37

An example: Subscale 2 Sum of Information Functions for 6, 7, and 8 Item Linking Sets 38

An example: Subscale 3 39

Why Fisher information is useful? In multidimensional CAT The volume of the confidence ellipsoid around is proportional to the determinant of (Anderson, 1984) Maximize the determinant of the Fisher information matrix (Segall, 1996, Wang & Chang, 2011). D optimal method 40

Fisher information vs. confidence ellipse θ 15 0 0 10 θ 0.067 0 41 0 0.1 Σ (Wang, et al., 2013)

Fisher information vs. confidence ellipse θ 50 0 0 25 θ 0.02 0 42 0 0.04 Σ (Wang, et al., 2013)

Mini max mechanism Assuming there are three dimensions, then,,, det, det, det, 2 det, This criterion tends to pick the items that minimize the variance of the estimator lagging behind most 43

Item bank Information 44

Domain/Content balancing Constraint weighted D optimal (Wang et al., 2017) Suppose for each domain, we have maximum and minimum number of items set in advance, {, }, k=1,..,d # of items belong to domain k so far, and n is the current test length, is the maximum test length indicates whether item j belongs to domain k (Cheng, et al., 2009) =, = 45

A simulation study Sample size N=2,000 Multivariate normal, with mean of 0 s, and covariance matrix Σ= Maximum a Posteriori (MAP) is used, and prior is multivariate normal with mean of 0 s and Evaluation criterion: root mean squared error (RMSE) N 1 RMSE( )= ( ˆ ) 1 i1 i1 N i 1 2 46

Results: Domain level recovery D optimal ( ) vs. Random selection ( ) 47

Results: Domain level recovery D optimal ( ) vs. Constraint weighted D optimal ( ) 48

Results: Domain level recovery D optimal ( ) vs. Constraint weighted D optimal ( ) 49

Reducing Test Length 50

(0, 0, 0) Test Length 51 θ Confidence Interval

(2, 2, 2) Test Length 52 θ Confidence Interval

Variable length CAT: Stopping rule Start 300+ items 53

Stopping rule Start 300+ items When the measurement precision criterion is satisfied (Dodd, Koch & De Ayala, 1993; Boyd, Dodd, & Choi, 2010) 54

Stopping rule Start 300+ items (a) Volume of the confidence ellipsoid (D rule) (b) Sum of S.E. per domain θ (c) Maximum axis of the confidence ellipsoid (d) Kullback Leibler divergence between to consecutive posteriors (Wang et al., 2013) 55

Cumulated information growth Test Length 56 Determinant of Fisher information matrix

Stopping rule Start 300+ items 57

Stopping rule Start 300+ items 58

Stopping rule Start 300+ items When θ does not change much: theta convergence rule (T rule) 0.01 (Babcock & Weiss, 2012 Wang et al., 2017+) 59

Why T rule is secondary? 2PL interval of ( ), is in the (Chang & Ying, 2008) 60

Why T rule is secondary? 2PL interval of ( ), is in the It does not monotonically decrease when test length increases! Terminate test pre maturely (Wang et al., 2017+) 61

Why T rule is secondary? 2PL interval of ( ) Undermine test efficiency Usually, the SE( )<.2 (Dodd, et al., 1993), is in the 25 If hypothetically 1, satisfying <.01 then 50 (Wang et al., 2017+) 62

MGRM Simple structure,, 0: 1,, 2 :, 1,,,, 1:,,, exp, (Wang et al., 2017+) 63

MGRM Simple structure.5,, 0: 1,, 2 :, 1,,,, 1:,,, exp, (Wang et al., 2017+) 64

MGRM Complex structure If item j measures the pth trait (Wang et al., 2017+) 65

MGRM Complex structure If item j measures the pth trait pth element of The amount of information carried by item j (Wang et al., 2017+) 66

MGRM Complex structure If item j measures the pth trait (Wang et al., 2017+) 67

MGRM Complex structure If item j measures the pth trait If item j measures multiple traits (Wang et al., 2017+) 68

Primary vs. Secondary stopping rules Start Minimum test length 300+ items (Babcock & Weiss, 2012 Wang et al., 2017+) 69

Primary vs. Secondary stopping rules Start Minimum test length 300+ items If D rule is satisfied? (Wang et al., 2017+) 70

Primary vs. Secondary stopping rules Start Minimum test length 300+ items If D rule is satisfied? Yes No If T rule is satisfied? (Wang et al., 2017+) 71

Primary vs. Secondary stopping rules Start Minimum test length 300+ items If D rule is satisfied? Yes No If T rule is satisfied? Yes No Continue (Wang et al., 2017+) 72

Primary vs. Secondary stopping rules Start Minimum test length Maximum test length 300+ items If D rule is satisfied? 94.9% 28.5 Yes No If T rule is satisfied? Yes No Continue 5.1% 61.5 (Wang et al., 2017+) 73

Stopping rule results Applied Cognition Daily Activity Mobility SE θ 74

3D plot 75

Stopping rule Cont. Test length Overall precision Primary stop Mean SD Bias RMSE Determinant Actual Eventual 28.5 13.3 0.005 0.303 514.7 0.949 0.965 1.6% 76

Stopping rule Cont. Test length Overall precision Primary stop Mean SD Bias RMSE Determinant Actual Eventual 28.5 13.3 0.005 0.303 514.7 0.949 0.965 1.6% Test length Bias RMSE Stop End Stop End Stop End Mean SD N=31 58.7 15.3 72.2 15.5 0.162 0.136 0.430 0.391 N=71 64.5 13.0 120 0 0.207 0.204 0.592 0.525 77

Outline Brief introduction to computerized adaptive testing (CAT) Multidimensional CAT Computerized Adaptive Testing to Direct Delivery of Hospital Based Rehabilitation (NIH R01HD079439, 2015 2020) Item bank calibration Item selection Stopping rules Ongoing projects 78

Project I: Classification AM PAC Color Coded Stages FIM score FIM Stage Independent (Green) Supervision Contact Guard (Yellow) Assistance (Orange) Dependent (Red) Table 2. High 7 Independent Low 6 Modified independent High 5 Supervision Low 4 Contact guard High 2 3 Min Mod Assist Low 1 Max Assist Red 0 Dependent 79

Project I: Classification Multidimensional CAT + Post hoc classification Or Multidimensional Classification CAT? 80

Project II: Incorporating response time (Fan, Wang, et al., 2012; Wang, et al., 2013a, 2013b; Wang & Xu, 2015) Exploratory data analysis (analysis per batch first) Histogram of batch 1 response time of all person item combinations (SD= 21.28, Skew= 41.84). Red line stands for the 97.5% percentile (25.85). 81

Project II: Incorporating response time (Fan, Wang, et al., 2012; Wang, et al., 2013a, 2013b; Wang & Xu, 2015) Exploratory data analysis (analysis per batch first) After cutting the upper 2.5% of data (SD= 4.27, Skew= 1.23) 82

Project II: Incorporating response time (Fan, Wang, et al., 2012; Wang, et al., 2013a, 2013b; Wang & Xu, 2015) Exploratory data analysis (analysis per batch first) After log transformation 83

Project II: Incorporating response time (Fan, Wang, et al., 2012; Wang, et al., 2013a, 2013b; Wang & Xu, 2015) A hierarchical response time model (van der Linden, 2007) Population μ,, σ, Item Item Person θ Item φ, λ Person τ 84

Four different models EM algorithm (1) According to Molenaar, et al. (2015), we can reparameterize van der Linden (2007) s joint model as MGRM ( ) Correlation between and (2) Including interviewers as covariates, and the interviewer effects differ across items 85

Four different models EM algorithm (3) Including interviewers as covariates, and the interviewer effects differ across items by a same proportion (4) Including interviewers as fixed covariates 86

Model 1 Model 2 Model 3 Model 4 87

Model comparison & Results Equation # of Free Parameters AIC BIC Batch 1 1 736 133566 136755 2 1281 133174 138725 3 741 133316 136527 4 741 133409 136620 Batch 2 1 652 102468 105202 2 940 102049 105992 3 655 102235 104982 4 655 102339 105086 Batch 3 1 656 111384 114149 2 1040 110613 114996 3 660 111001 113783 4 660 111323 114105 Batch 4 1 648 108550 111290 2 1028 107733 112080 3 652 108174 110931 4 652 108364 111121 Model 3 results (batch 1) θ θ 0.613 θ θ 0.466 0.853 Estimates of are: 0.591, 0.691 and 0.596 Compared to MGRM alone, adding response time results in higher item discrimination parameter estimates and smaller standard errors. 88

Concurrent calibration across 4 batches Adding response time information did not affect the item parameter estimates and their standard errors significantly; Adding response time information helped reduce the standard error of patients multidimensional latent trait estimates, but adding interviewer as a covariate did not result in further improvement. 89

Next steps II: Incorporating response time (Fan, Wang, et al., 2012; Wang, et al., 2013a, 2013b; Wang & Xu, 2015) A hierarchical response time model (van der Linden, 2007) Maximize item information per time unit Maximize 90

3 factors to consider Next steps III: DIF CAT (Wang, Weiss, & Wang, 2017) Gender (Male/Female) Education (College+/high school and below) Age (<65/65~90) 91

Example DIF items Gender How much difficulty do you currently have making decisions, such as what clothes you want to wear? (Applied Cognition), consistent with expert hypothesis. Age How much difficulty do you currently have removing a plastic lid from a hot beverage cup? (Daily activity) How much difficulty do you currently have climbing stairs step over step without a handrail? (Mobility) 92

How to deal with DIF in a CAT design? Items with extreme DIF delete? Items with small DIF keep? Doubly adaptive CAT using subgroup information to improve measurement precision (Wang et al., 2017) Allow DIF items to have different parameters per subgroup Constraint weighted D optimal 93

Project IV: Adaptive measure of change (Wang & Weiss, 2017, Wang, 2014) Specifying the MCAT to efficiently detect meaningful clinical change 94

Study I 95

Project IV: Adaptive measure of change (Wang & Weiss, 2017, Wang, 2014) θ Time 1 Time 2 96

Project IV: Adaptive measure of change (Wang & Weiss, 2017, Wang, 2014) Item selection? Select an item that can best differentiate null hypothesis (no individual change) from alternative hypothesis. Sequential hypothesis testing? Stopping rule Time 1 θ Time 2 maximize ˆ ˆpooled KLj( i2, i( L k 1) ) 97

Algorithms Web based delivery Data collection with MCAT Monitor item usage, and routinely recalibrate item parameters if needed (Chen & Wang, 2016) 98

My collaborators and team Dr. David Weiss University of Minnesota Dr. Andrea Cheville Mayo Clinic Research Assistants: Zhuoran Shang Shiyang Su 99