Chapter 2. Test Development Procedures: Psychometric Study and Analytical Plan

Similar documents
Chapter 9. Youth Counseling Impact Scale (YCIS)

P T P B. Peabody Treatment Progress Battery. Manual of the. Center for Evaluation and Program Improvement

Chapter 4. Symptoms and Functioning Severity Scale (SFSS)

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments

Chapter 1. Peabody Treatment Progress Battery (PTPB)

1. Evaluate the methodological quality of a study with the COSMIN checklist

The Psychometric Development Process of Recovery Measures and Markers: Classical Test Theory and Item Response Theory

Assessing Measurement Invariance in the Attitude to Marriage Scale across East Asian Societies. Xiaowen Zhu. Xi an Jiaotong University.

Validating Measures of Self Control via Rasch Measurement. Jonathan Hasford Department of Marketing, University of Kentucky

The Development of Scales to Measure QISA s Three Guiding Principles of Student Aspirations Using the My Voice TM Survey

Instrument equivalence across ethnic groups. Antonio Olmos (MHCD) Susan R. Hutchinson (UNC)

NIH Public Access Author Manuscript Adm Policy Ment Health. Author manuscript; available in PMC 2013 March 01.

The Functional Outcome Questionnaire- Aphasia (FOQ-A) is a conceptually-driven

ASSESSING THE UNIDIMENSIONALITY, RELIABILITY, VALIDITY AND FITNESS OF INFLUENTIAL FACTORS OF 8 TH GRADES STUDENT S MATHEMATICS ACHIEVEMENT IN MALAYSIA

Confirmatory Factor Analysis of Preschool Child Behavior Checklist (CBCL) (1.5 5 yrs.) among Canadian children

Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys

International Conference on Humanities and Social Science (HSS 2016)

Editorial: An Author s Checklist for Measure Development and Validation Manuscripts

Examining the Psychometric Properties of The McQuaig Occupational Test

Validity and reliability of measurements

Reliability. Internal Reliability

Analysis of the Reliability and Validity of an Edgenuity Algebra I Quiz


ADMS Sampling Technique and Survey Studies

References. Embretson, S. E. & Reise, S. P. (2000). Item response theory for psychologists. Mahwah,

Review of Various Instruments Used with an Adolescent Population. Michael J. Lambert

Associate Prof. Dr Anne Yee. Dr Mahmoud Danaee

Does factor indeterminacy matter in multi-dimensional item response theory?

Measuring Perceived Social Support in Mexican American Youth: Psychometric Properties of the Multidimensional Scale of Perceived Social Support

Factor Analysis of Gulf War Illness: What Does It Add to Our Understanding of Possible Health Effects of Deployment?

Confirmatory Factor Analysis of the BCSSE Scales

Testing the Multiple Intelligences Theory in Oman

Comprehensive Statistical Analysis of a Mathematics Placement Test

Everything DiSC 363 for Leaders. Research Report. by Inscape Publishing

CHAPTER VI RESEARCH METHODOLOGY

Modeling the Influential Factors of 8 th Grades Student s Mathematics Achievement in Malaysia by Using Structural Equation Modeling (SEM)

Using the STIC to Measure Progress in Therapy and Supervision

Assessing Cultural Competency from the Patient s Perspective: The CAHPS Cultural Competency (CC) Item Set

(77, 72, 74, 75, and 81).

By Hui Bian Office for Faculty Excellence

ACDI. An Inventory of Scientific Findings. (ACDI, ACDI-Corrections Version and ACDI-Corrections Version II) Provided by:

Book review. Conners Adult ADHD Rating Scales (CAARS). By C.K. Conners, D. Erhardt, M.A. Sparrow. New York: Multihealth Systems, Inc.

Techniques for Explaining Item Response Theory to Stakeholder

The Bilevel Structure of the Outcome Questionnaire 45

Approaches for the Development and Validation of Criterion-referenced Standards in the Korean Health Literacy Scale for Diabetes Mellitus (KHLS-DM)

EMOTIONAL INTELLIGENCE skills assessment: technical report

Making a psychometric. Dr Benjamin Cowan- Lecture 9

Chapter 3. Psychometric Properties

Validation of an Analytic Rating Scale for Writing: A Rasch Modeling Approach

Child Outcomes Research Consortium. Recommendations for using outcome measures

The MHSIP: A Tale of Three Centers

Examining the efficacy of the Theory of Planned Behavior (TPB) to understand pre-service teachers intention to use technology*

Examining Factors Affecting Language Performance: A Comparison of Three Measurement Approaches

Variable Measurement, Norms & Differences

VARIABLES AND MEASUREMENT

Author s response to reviews

Oak Meadow Autonomy Survey

Basic concepts and principles of classical test theory

Validity. Ch. 5: Validity. Griggs v. Duke Power - 2. Griggs v. Duke Power (1971)

SEM-Based Composite Reliability Estimates of the Crisis Acuity Rating Scale with Children and Adolescents

Anumber of studies have shown that ignorance regarding fundamental measurement

Reliability and Validity of the Divided

PLS 506 Mark T. Imperial, Ph.D. Lecture Notes: Reliability & Validity

The Psychometric Properties of Dispositional Flow Scale-2 in Internet Gaming

The measurement of media literacy in eating disorder risk factor research: psychometric properties of six measures

VALIDATION OF TWO BODY IMAGE MEASURES FOR MEN AND WOMEN. Shayna A. Rusticus Anita M. Hubley University of British Columbia, Vancouver, BC, Canada

CLINICAL VS. BEHAVIOR ASSESSMENT

Validity and reliability of measurements

Original Article. Relationship between sport participation behavior and the two types of sport commitment of Japanese student athletes


The Psychometric Principles Maximizing the quality of assessment

Consumer Perception Survey (Formerly Known as POQI)

The Multidimensionality of Revised Developmental Work Personality Scale

The State COPAS Scales: Development, Validation, and Confirmatory Analyses. Cathy Lawrenz, Rod Van Whitlock, Bernard Lubin

Factor analysis of alcohol abuse and dependence symptom items in the 1988 National Health Interview survey

DISCPP (DISC Personality Profile) Psychometric Report

Chapter 15. CFS : Contextualized Feedback Systems 8

validscale: A Stata module to validate subjective measurement scales using Classical Test Theory

Reanalysis of the 1980 AFQT Data from the NLSY79 1

Analysis of Confidence Rating Pilot Data: Executive Summary for the UKCAT Board

Guidelines for using the Voils two-part measure of medication nonadherence

Test review. Comprehensive Trail Making Test (CTMT) By Cecil R. Reynolds. Austin, Texas: PRO-ED, Inc., Test description

Development of self efficacy and attitude toward analytic geometry scale (SAAG-S)

On the Performance of Maximum Likelihood Versus Means and Variance Adjusted Weighted Least Squares Estimation in CFA

Gezinskenmerken: De constructie van de Vragenlijst Gezinskenmerken (VGK) Klijn, W.J.L.

A Comparison of Several Goodness-of-Fit Statistics

A 2-STAGE FACTOR ANALYSIS OF THE EMOTIONAL EXPRESSIVITY SCALE IN THE CHINESE CONTEXT

Author Note. LabDCI, Institute of Psychology, University of Lausanne, Bâtiment Antropole, CH-1015

Validation of the Russian version of the Quality of Life-Rheumatoid Arthritis Scale (QOL-RA Scale)

Confirmatory Factor Analysis of the Procrastination Assessment Scale for Students

Reliability Study of ACTFL OPIc in Spanish, English, and Arabic for the ACE Review

The Youth Experience Survey 2.0: Instrument Revisions and Validity Testing* David M. Hansen 1 University of Illinois, Urbana-Champaign

Manifestation Of Differences In Item-Level Characteristics In Scale-Level Measurement Invariance Tests Of Multi-Group Confirmatory Factor Analyses

Assessing Social and Emotional Development in Head Start: Implications for Behavioral Assessment and Intervention with Young Children

Factorial Validity and Consistency of the MBI-GS Across Occupational Groups in Norway

Doing Quantitative Research 26E02900, 6 ECTS Lecture 6: Structural Equations Modeling. Olli-Pekka Kauppila Daria Kautto

Technical Specifications

College Student Self-Assessment Survey (CSSAS)

Chapter 2 Norms and Basic Statistics for Testing MULTIPLE CHOICE

Transcription:

Chapter 2 Test Development Procedures: Psychometric Study and Analytical Plan Introduction The Peabody Treatment Progress Battery (PTPB) is designed to provide clinically relevant measures of key mental health outcomes and clinical processes. The measures, especially with their repeated use, offer clinicians systematic feedback on their clients, both individually and in relation to other clients served. Such feedback provides rich clinical material for treatment planning, particularly for clients with outcomes that are not improving as expected. A useful test battery has the following qualities: Reliable: Every question in a battery must contribute accurate information. Valid: Scores must have evidence-based interpretations. Short: Measures must be feasible to administer in the time available. A brief battery enables clinicians and clients to spend their time more effectively. Theory-based: A battery must have an understandable theoretical core so that clinicians, caregivers, and youths can understand the results. Integrated: A battery must cover main issues in a cohesive, integrated way, something a collection of unrelated instruments cannot do. We think the PTPB meets these criteria. At this time, we have strong evidence of the reliability of the instruments. We have some evidence for the construct validity including estimates of factorial and discriminant validity for all instruments, and estimates of concurrent (i.e., convergent) validity for the Symptoms and Functioning Severity Scale (SFSS) in the form of correlations with widely used instruments, namely the Child Behavior Checklist (CBCL; Achenbach, 1991) and Youth Self Report (YSR; Achenbach); the Strengths and Difficulties Questionnaire (SDQ; Goodman, 1999); the Youth Outcome Questionnaire (Y-OQ ; Wells, Burlingame, & Lambert, 1999) and the Child and Adolescent Functioning Assessment Scale (CAFAS; Hodges & Wong, 1996). We think that the instruments in the PTPB have good reliability and evidence of validity, but it will take more research to further evaluate their construct validity (Cronbach, 1955) and predictive validity for monitoring the process and outcome of counseling. A multi-method approach was used to assure that the PTPB met the criteria described above, including expert review, cognitive testing, psychometric study, and a rigorous analysis plan. 9

Expert Reviews Measures were distributed to a large number of clinicians, researchers, and other experts in the field who provided feedback on the usefulness and validity (face, content) of measures and of items within each measure. This resulted in some revisions (e.g., items reworded, added, dropped). Cognitive Testing The instruments were then administered to convenient samples of youths and their adult caregivers. Trained staff conducted separate cognitive interviews with youths and caregivers to help assure the validity of measures and ensure age appropriate readability. Psychometric Study We conducted a study from May to September, 2005 to obtain descriptive information about youths served, to test the psychometric properties of measures in the PTPB, and to get feedback from clients, adult caregivers, and clinicians on their perceived utility of the measures. The sample for this study was made available to us by a national social service agency (our collaborating partner in CFIT) that delivers home- and community-based services for youths and their families. In 2006, the agency served over 43,000 clients with 600 contracts in 32 states and the District of Columbia. Over 600 youths in 28 offices (called regions by the service organization), mostly located in the Eastern or Midwestern U.S., participated in the study as described below. Rigorous Analysis Plan Analyses of the PTPB assessed reliability, validity, and brevity to produce an integrated set of measures that can be administered efficiently and at low cost. Each measure was tested with classical test theory, exploratory and confirmatory factor analysis, and Rasch modeling. All measures were reduced to the minimum length consistent with reliability. Too, the temporal stability and concurrent validity of the essential mental health measure (SFSS) were tested. A description of the general analytic procedures is below. The history of development, theory base, and statistical results for each instrument are presented in chapters 4 to 13. The examination of the measurement quality of all instruments in the PTPB is continuing, with future updates planned for this manual. Psychometric Study Eligibility Criteria Youths 11-18 years old inclusively whose clinicians thought they could understand questions such as those in the PTPB were eligible to participate in the study. One primary adult caregiver was also asked to participate if any was present at the time forms were completed. Clients had to have had at least one therapeutic session prior to study participation to help assure adequate familiarity among the youth, adult caregiver, and clinician when answering questions (e.g., about the youth s mental health status). If a youth had more than one clinician, the one considered the primary clinician and who saw the youth during the data collection period completed the clinician forms. 10

Confidentiality All data were obtained as part of our collaborating partner s continuous quality improvement (CQI) initiative; Contextualized Feedback Intervention and Training (CFIT) (see Chapter 14 for more information on CFIT). Vanderbilt had no contact with participants either for recruitment or data collection. Names or other information that could readily identify respondents were not sought or obtained. All data received and maintained by Vanderbilt included only a unique non-sensitive ID number for each participant. Administration Protocols Sometimes clinicians need to assist youths or caregivers whose reading is impaired. For the study, clinicians were provided their own copy of all youth and caregiver questionnaires that they could read to youths and caregivers, as needed. Clinicians were to read the instructions first, then the question, then the possible answer choices, giving respondents enough time to mark their answers on their own forms. Clinicians could repeat answer choices if needed. Youths/caregivers were asked to answer items the best way they could, as they understood them. Clinicians were instructed to help, if requested, with questions but not to help with answers. We would recommend protocols such as these to PTPB users. In order to help assure widespread use of the battery, all instruments were written for respondent understanding at the 4 th grade reading literacy level. The literacy level was tested with Microsoft Office Word and then with the online SMOG (simplified measure of gobbledygook) Calculator. Standard forward-backward translation methods were used to create Spanish versions. A bilingual, native Hispanic translated the original English version into Spanish. A second bilingual, non-native speaker back-translated the Spanish version into an English version, which was compared to the original English version to assure accuracy of translation. General Procedures The youth, a primary adult caregiver, if present, and clinician completed the measures at the end of a session to reduce any undue influence completing the forms might have on the therapeutic process. In order to minimize burden while testing, the battery was divided into two booklets of different (not repeated ) measures and administered at two consecutive sessions (on average, one week apart). Certain measures such as the Therapeutic Alliance Quality Scale (TAQS) were purposefully placed in the 2 nd booklet to help assure that even relatively new clients (and their adult caregiver) would have at least one or two sessions with their clinician before answering questions about the therapeutic relationship. The SFSS and the external measure used to test its validity were placed in the 1 st booklet. Clinicians were allowed to read questions to youths and adult caregivers, but were instructed not to help with answers. All youth and adult caregiver measures were available in English and Spanish. Offices were asked to administer both booklets to all eligible clients within four weeks then ship their completed materials to Vanderbilt. This time frame was optimistic; data were received from the 1 st region about seven weeks after the study s start date. 11

A multi-step process began once data were received. First, the number of booklets received was logged by region, respondent type (youth, adult caregiver, or clinician) and administration (1 st or 2 nd ). Second, a detailed protocol was used to check for data quality, including problems of respondents recording two answers for the same item or highly suspect response patterns that would suggest invalid data. Coin toss was used to determine which among two answers for the same item to code, only when the responses represented adjacent categories and were not contradictory. Heads always indicated coding the higher value; tails, the lower. For example, if strongly agree (code 5) and agree (code 4) were marked and the coin toss showed heads, strongly agree (code 5) would be coded for data entry. If agree and disagree were both endorsed, the item was coded as missing. Otherwise, data that remained ambiguous were considered missing. Unusual response patterns, for example, answering all questions identically, were reviewed independently by two raters. There were remarkably few instances of aberrant response patterns, and raters nearly always agreed when one presented itself. The project s data manager made the final determination in the event of inter-rater disagreement. Cases with any response pattern were flagged, by measure, so that they could be excluded as needed during analysis. Additional safeguards were used for data entry. After initial cleaning, data were entered twice into a Microsoft Office Access database. Special programming alerted data entry staff to discrepancies between the two entries as well as entry of out-of-range values for each variable. The database was translated into SAS system files. Univariate statistics (e.g., frequencies; means) for each variable were generated and examined for accuracy. As Table 2.1 shows, 610 youths, 486 caregivers, and 603 clinicians completed the 1 st booklet. Somewhat fewer completed the 2 nd booklet, with completion rates of 85%, 84%, and 88%, respectively. Table 2.1 Booklet Completion Status Completed 1 st and 2 nd Booklets Respondent 1 st Booklet (#) 2 nd Booklet (#) 2 nd Booklet Completion (%) Youths 610 519 85% Adult Caregivers 486 408 84% Clinicians 603 532 88% Sample Characteristics Of the 610 youths who completed the 1 st booklet, 42% were female. Over a third (36%) were African American; 56% white; and 8% multi-ethnic or of other ethnic backgrounds. Eighteen percent was Hispanic. On average, youths were 15 years old, with 15% between 11-12 years of age; 45% between 13-15 years of age, and 40% between 16-18 years of age. Based on caregiver reports, over two-thirds (67%) of youths had been diagnosed with a mental health disorder at some time before or with current treatment. Almost half (48%) of the youths had used alcohol or drugs. 12

On average, caregivers were 45 years old. Nearly 90% were female. About 40% were African American; 57% white; and 3% multi-ethnic or of other ethnic backgrounds. Eleven percent was Hispanic. About 20% of caregivers had not received a high school diploma; two-thirds had completed high school; and 15% had earned a bachelor s or higher degree. Almost half (45%) had annual household incomes less than $21,000; 37% had incomes of $31,000 or more. About a third of caregivers were divorced or separated, about half were currently married or living as married, 14% were never married, and others were widowed. Two-thirds reported ever being diagnosed with an emotional, behavioral, or substance use problem, but only 20% reported ever receiving professional help. Nearly all (95%) caregivers considered themselves to be the youth s primary caregiver. About half were the youth s birth parent, another 33% were a foster parent, with others as grandparents or other family members. Three-fourths reported they knew the youth at least fairly well, with others (virtually all foster parents) reporting they did not know the youth too well. Analytical Plan The psychometric analyses emphasized evaluating every item of every instrument for its reliability and validity. Determining Reliability Reliability of each instrument was determined using several standard methods in addition to methods specifically designed for developing brief instruments for frequent use. These methods are described below. Note that established psychometrics for all types of reliability described below are not available for all instruments (see chapters 4-13 for a discussion of tests for each measure). Reliability Coefficients Cronbach s alpha internal consistency reliability correlations were calculated for total scores and any subscale scores where applicable. This statistic is larger when internal consistency is high, and smaller when it is low. A Cronbach s alpha of 0.80 or higher is generally considered satisfactory (Clark & Watson, 1995), indicating that a measure is of sufficient length and that the items appear to be measuring similar content. Comprehensive Item Psychometrics We examined each item using currently available models for psychometrics, namely classical test theory (CTT), confirmatory factor analysis (CFA), and Rasch modeling. We view all three methods as valid tools each with strengths and limitations for creating brief instruments for frequent use. Each of the models produces information to identify stronger and weaker items in a given test. By putting this information into a single table, we can evaluate a test and its items at a glance. Information about the statistical merits of each item is necessary to determine whether a test should be revised. Note that 13

throughout this manual, we use the terms Rasch modeling and Item Response Theory (IRT) interchangeably to refer to the logistic model-based approach to test development as compared to the CTT. Table 2.2 shows the statistical information for a test or questionnaire with Likert scale items, and the criteria used to check each item. Table 2.2 Statistical Properties of Effective Test Items Mean Criterion Sought Values Rationale Rasch Measure Score Kurtosis Item-Total Correlation. Between 2 and 4 (5 pt scale) Cover the range Not extremely high Higher better Infit & Outfit Between 0.5 and 1.5 Avoid floors and ceilings in the target sample Need items across the range of youth Avoid items where everyone gives same response Keep items that measure a single thing Keep items that fit 1PL (logistic) model Discrimination Avoid low discrimination Avoid items that can t discriminate Note: IRT with WINSTEPS 3.63.0 (Linacre, 2007) Below is a brief explanation of the rationale for each criterion: Mean: If an item s mean is at the top or bottom of its range (e.g., a mean of 4.8 or 1.2 on a 1-5 scale) this means that nearly everyone is giving the same rating, making it impossible for that item s meager variance to correlate with anything. Kurtosis: A good way to spot items with poor measurement ability is to focus on those that are excessively leptokurtic, meaning that a great number of people all have the same score. An example is the item I have attempted suicide in a nonclinical sample. Since nearly all would say No, the kurtosis is very high. Traditionally, psychometricians would say that items with extreme means and kurtosis have floor or ceiling artifacts that make little psychometric contribution to a test. Rasch Measure Score: An item s difficulty or rarity as expressed in the measure score in the Rasch logistic model shows where the item is efficient and informative about a given test taker. For example, an easy item to endorse such as I have worried more than once might tell us nothing about differences between serious cases of psychopathology. The most efficient strategy for accurate measurement is to have a range of items from very easy to very difficult or unusual (e.g., I have committed homicide ) so that the entire range of clients is measured reliably. Please note that two different types of scaling were used in this manual. Most chapters scaled the Rasch measure scores to have a mean of 50 and standard deviation of 10. Three 14

chapters (8, 9, and 10) used a scaling with a mean of 0 and a standard deviation of 1. Item-Total Correlation: In classical test theory (Lord & Novick, 1968), test designers realized that if a group of items is a reliable measure of a single construct, such as psychopathology or intelligence, the items must be correlated. By observing the correlation between each item and the sum of all the others, we can identify the odd unrelated items by their low correlation. It is senseless to add unrelated items into an index score. By dropping items with low item-total correlations, we create an index with high internal consistency. Rasch Model Infit and Outfit: Since the Rasch model defines good measurement, items that fit the model are good items, and scores on the good items show a consistent s-shaped logistic to give relationship to the person s strength of the measured trait. The Infit mean square measures model fit for the middle cases in the distribution; Outfit, for the extreme cases at the tails of the distribution. According to Linacre (2006), items with an Infit and Outfit between 0.5 and 1.5 contribute to the reliability of measurement and items outside that range do not. Rasch Model Discrimination: While the Rasch model is a 1-parameter logistic model, WINSTEPS 3.63.0 (Linacre, 2007) estimates each item s discrimination after the 1 parameter logistic model is estimated. Items with low discrimination are less effective at detecting which people are high or low on the trait measured. We dropped items from some instruments based on all of these indices of item quality. Please understand that while these criteria provide guidance, there are no agreed upon cut scores for most of them. Fortunately, in our testing, the eliminated items had warning flags on multiple items. Standard Errors of Measurement The standard error of measurement (SEM) shows how much uncertainty there is around each youth s score on a given occasion. This statistic is smaller (i.e., more precise) when reliability is high, and larger when it is low (i.e., the test s standard deviation is high). Assuming, for example, that the total score on the adult caregiver SFSS (full version) was 60, a standard error of 2.62 would mean that the true severity level for the youth is between 54.76 and 65.24 with approximately 95% certainty. Thus, the smaller the standard error, the more precise the measurement. Reliable Change Index The reliable change index (RCI) provides guidance about when a client s change in scores is so large that we can be fairly certain that the client either improved or got worse significantly. For example, if a youth s self report score on the SFSS decreased by more than 5.36 points (which corresponds to the reliable change threshold reported in each chapter), we would be 75% confident that the change in scores represents a significant improvement in the client s severity level. This does not mean that a change that is less than the indicated reliable change threshold reported for each scale is not meaningful. 15

However, because respondents to questionnaires do not always use the available answer options consistently (e.g., one time a youth uses sometimes and another time often to describe the frequency with which he got into trouble, even though nothing has really changed), we cannot be very certain whether the score change on the scale is related to a real change in severity level, if the change is less than the reliable change threshold. If the change in scores is less than the reliable change threshold, the clinician may want to determine in some other way if a change in scores represents real change. Reviewing which items contributed the most to the change in scores may provide some insight. It is also helpful to look at a trend with more than just two data points. A trend is more reliable than a simple change score, and provides information on whether a score is stable, improving, or declining over time. Test-Retest Reliability Test-retest reliability is a good indicator of an instrument s reliability over time. This form of reliability is only available for the SFSS at this time. Future updates of the PTPB will include additional analyses for the other instruments. The procedures for test-retest reliability for the SFSS are as follows. Four different regional offices were targeted for a reliability test-retest of the SFSS. These regions received only the two-week SFSS in their 1 st and 2 nd booklets and a few items on demographic background and the number of sessions between administrations. 3 Clinicians were instructed to administer the two booklets 4-7 days apart at the end of each of two sessions. If the two sessions were more than 4-7 days apart, they were asked to administer the 1 st booklet, then the 2 nd at the very next session. Participants for whom there was less than one day or more than 14 days between administrations were not included in this assessment of test-retest reliability. The average interval between the two administrations for the final sample was 7 days (SD=3.8). Table 2.3 shows the number of clients in each region about whom youths, adult caregivers, and clinicians reported during the first (test) and second (retest) administrations of the SFSS. Table 2.3 SFSS Reliability Test: Participation by Region SFSS Reliability Test # of Regions # of Youth Reports 1 st Adm. 2 nd Adm. # of Adult Caregiver Reports 1 st 2 nd Adm. Adm. # of Clinician Reports 1 st Adm 2nd Adm. Total for Reliability Test 4 105 84 85 54 108 88 3 Additionally, for two of the regions the order of items in the SFSS was reversed between the two regions in order to test for order effects. Roughly the first half of items was at the beginning of the measure for one region (both administrations) while at the end of the measure for the other region (both administrations). There was no significant order effect in this sample. 16

Determining Validity Factorial Validity One of the chief purposes of the study was to assess the factorial validity of the PTPB measures. Factorial validity examines whether the correlations among test items fit the theory of what the test purports to measure. For some measures where the factorial structure was more complex and not as clearly determined by theory, we split the psychometric sample randomly into two groups, a test half of the sample and a validity sample. Analyses were then conducted with the test sample until a clear factor structure was determined and the model fit the test data in a satisfactory way. Then, this factorial structure was validated using the validity sample. Scree Plots A scree plot of eigenvalues is used in exploratory factor analysis for a visual inspection of whether the data are best represented by more than one factor. The plot shows the eigenvalue of each principle component. Most often, scree plots were used to see if a test fits a one- factor model. If the second eigenvalue is less than two or three, then a robust second factor might not be possible. Some instruments, such as the Caregiver Strain Questionnaire (CGSQ), are multifactorial, meaning they have subtests. The multiple factors were developed by inspection of scree plots followed by confirmatory factor analysis on a split sample. We would like to note that scree plots are tools of exploratory factor analyses. Whereas exploratory factor analysis (EFA) plays an important role in identifying the underlying dimensions of an instrument when there are no a priori theoretical expectations, confirmatory factor analysis (CFA) is most useful for evaluating whether a theoretically meaningful model fits the observed data (Floyd & Widaman, 1995). Confirmatory Factor Analysis Confirmatory factor analysis (CFA) estimates how well a measurement model fits the data. For example, when several sub-scales are suggested by theory, they can be tested to see how well the theory fits the data. CFA lets us determine the factorial validity of each scale. While this form of validity is less important than criterion validity, it is necessary for the interpretation of scores. To estimate model fit, we used three popular fit statistics named in Table 2.4: Bentler s Comparative Fit Index (CFI; Bentler, 1990), Joreskog s Goodness of Fit Index (GFI; Joreskog, 1988), and the Root Mean Square Error of Approximation (RMSEA; Steiger, 2000). A one-factor model required these rather exacting cutoffs (Yu, 2002). According to Browne and Cudeck (1993), values greater than 0.90 indicate good fit between a model and the data for the CFI and GFI; for the RMSEA, a value of 0.05 indicates close fit, 0.08 fair fit, and 0.10 marginal fit. 17

Table 2.4 Fit Criteria Criterion Bentler s CFI Joreskog s GFI RMSEA 0.90 or more 0.90 or more Requirement Satisfactory: 0.05 or less Fair: 0.08 to 0.051 Convergent and Discriminant Validity Because these forms of validity are particularly important for a battery of instruments, an entire chapter is devoted to a discussion of the procedures and results (see Chapter 3). General Analytical Issues Age, Gender, Race and Ethnicity For each instrument, multi-factorial ANOVAs, followed by multiple comparisons, were conducted to determine if any differences were apparent by youth age, gender, and racial or ethnic group. Significance levels were adjusted to account for the multiple tests using the Bonferroni adjustment. Only two comparisons were statistically significant. For the clinician report of the SFSS-33, white youth had somewhat higher scores (mean = 63.7, SD = 8.9) than minority youth (mean = 60.8, SD = 7.9), F = 10.96, p < 0.001. For the adult caregiver version of the TOES, a significant age by race interaction was found, which reflects slightly lower outcome expectations for caregivers of minority youth between the ages of 14 and 15 years than caregivers of white youth. These differences were so small in magnitude in the overall context of non-significant comparisons that it was decided not to present different norms for these groups at this time. It is also important to note that these analyses were based on a sample of youth who received services for their mental health problems. It may well be that there are more pronounced differences among these groups (gender, race, ethnicity, and age) in nonclinical population samples. Future analyses with additional samples are needed to determine whether different norms for these groups should be established. Missing Data Analyses for each measure were based on cases with 85% or more completed data on the measure. Among these cases, subject mean substitution was used for any missing data. Cases with less than 85% completed data were excluded entirely. Please note that the self-scoring forms provided with this battery can be completed directly for a subject when all data are present. When data are missing, users will need to compute the subject s mean across all items in that measure first and substitute it for any missing item. Determining the cut point (e.g., 85%) for when too much data are missing to compute summary scores in this way is at the discretion of the user. 18

Instrument Scoring and Interpretation In each chapter, specific scoring procedures are presented for the total score and any subscale scores (if applicable) with examples of the self-scoring forms and information on the clinical interpretation of scores. All PTPB measures and self-scoring forms can also be found in Appendix B: Measures and Self-Scoring Forms. In addition to the reliable change index described previously, total and subscale scores for an individual respondent can be compared for that same respondent over time as a trend, and in relation to the psychometric sample. Comparisons to the psychometric sample are given in the form of low, medium, and high scores. The low, medium, high scores help to judge whether a score should be considered relatively low, medium, or high, compared to the current psychometric sample. For example, if a youth s SFSS Total Score falls in the low range, it means that this youth is rated to have relatively low signs of symptoms and functioning relative to the youths in the psychometric sample (not a normal school population). Percentile rank tables that represent the distribution of scores in the psychometric sample are presented as well. Additional samples for comparison will be available in the future. References Achenbach, T. M. (1991). Integrative guide for the 1991 CBCL/4-18, YSR, and TRF profiles. Burlington, VT: University of Vermont, Department of Psychiatry. Bentler, P. M. (1990). Comparative fix indexes in structural models. Psychological Bulletin 107(2): 238-246. Browne, M. W. & Cudeck, R. (1993). Alternative ways of accessing model fit. In K. A. Bollen & J. S. Long (Eds.), Testing structural equation models (pp. 136-162). Newbury Park: Sage. Clark, L. A., & Watson, D. (1995). Constructing validity: Basic issues in objective scale development. Psychological Assessment, 7(3), 309-319. Cronbach, L. J. & P. E. Meehl (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281-302. Floyd, F. J., & Widaman, K. F. (1995). Factor analysis in the development and refinement of clinical assessment instruments. Psychological Assessment, 7, 286-299. Goodman, R. (1999). The extended version of the Strengths and Difficulties Questionnaire as a guide to child psychiatric caseness and consequent burden. The 19

Journal of Child Psychology and Psychiatry and Allied Disciplines, 40(5), 791-799. Hodges, K., & Wong, M. M. (1996). Psychometric characteristics of a multidimensional measure to assess impairment: The Child and Adolescent Functional Assessment Scale. Journal of Child and Family Studies, 5(4), 445-467. Joreskog, K. G. (1988). Analysis of covariance structures. In J. R. Nesselroade and R. B. Cattell (Eds.), Handbook of multivariate experimental psychology (2nd ed., pp.207-230). New York, NY, Plenum Press Linacre, J. M. (2007). WINSTEPS 3.63.0.[Computer software]. Retrieved Jan 8, 2007, from http://www.winsteps.com/index.htm. Linacre, J. M. (2006). Misfit diagnosis: infit outfit mean-square standardized. Retrieved June 1, 2006, from http://www.winsteps.com/winman/diagnosingmisfit.htm. Lord, F. M. & Novick, M. R.. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. SMOG Calculator [Online computer program]. Retrieved December 4, 2006, from http://www.wordscount.info/hw/smog.jsp Steiger, J. H. (2000). Point estimation, hypothesis testing, and interval estimation using the RMSEA: Some comments and a reply to Hayduck and Glaser. Structural Equation Modeling: A Multidisciplinary Journal, 72(2), 149-162. Wells, M.G., Burlingame, G.M., & Lambert, M.J. (1999). Youth Outcome Questionnaire. In M.E. Maruish (Ed.), The use of psychological testing for treatment planning and outcome assessment (2 nd ed.). Mahwah NJ: Lawrence Erlbaum Associates Yu, C.-Y. (2002). Evaluating cutoff criteria of model fit indices for latent variable models with binary and continuous outcomes (Doctoral dissertation, University of California, Los Angeles, 2002). (UMI ProQuest ATT 3066425). 20