Using Structural Equation Modeling to Test for Differential Reliability and Validity: An Empirical Demonstration

Similar documents
Personal Style Inventory Item Revision: Confirmatory Factor Analysis

Assessing Measurement Invariance in the Attitude to Marriage Scale across East Asian Societies. Xiaowen Zhu. Xi an Jiaotong University.

Modeling the Influential Factors of 8 th Grades Student s Mathematics Achievement in Malaysia by Using Structural Equation Modeling (SEM)

Confirmatory Factor Analysis of the Group Environment Questionnaire With an Intercollegiate Sample

Manifestation Of Differences In Item-Level Characteristics In Scale-Level Measurement Invariance Tests Of Multi-Group Confirmatory Factor Analyses

Doing Quantitative Research 26E02900, 6 ECTS Lecture 6: Structural Equations Modeling. Olli-Pekka Kauppila Daria Kautto

Analysis of the Reliability and Validity of an Edgenuity Algebra I Quiz

Alternative Methods for Assessing the Fit of Structural Equation Models in Developmental Research

Impact and adjustment of selection bias. in the assessment of measurement equivalence

Basic concepts and principles of classical test theory

On the Performance of Maximum Likelihood Versus Means and Variance Adjusted Weighted Least Squares Estimation in CFA

ASSESSING THE UNIDIMENSIONALITY, RELIABILITY, VALIDITY AND FITNESS OF INFLUENTIAL FACTORS OF 8 TH GRADES STUDENT S MATHEMATICS ACHIEVEMENT IN MALAYSIA

Assessing the Validity and Reliability of a Measurement Model in Structural Equation Modeling (SEM)

Examining the efficacy of the Theory of Planned Behavior (TPB) to understand pre-service teachers intention to use technology*

Use of Structural Equation Modeling in Social Science Research

CHAPTER VI RESEARCH METHODOLOGY

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

Figure: Presentation slides:

The Modification of Dichotomous and Polytomous Item Response Theory to Structural Equation Modeling Analysis

Chapter 9. Youth Counseling Impact Scale (YCIS)

Factorial Validity and Consistency of the MBI-GS Across Occupational Groups in Norway

Anumber of studies have shown that ignorance regarding fundamental measurement

Lessons in biostatistics

Personality Traits Effects on Job Satisfaction: The Role of Goal Commitment

Applications of Structural Equation Modeling (SEM) in Humanities and Science Researches

Instrument equivalence across ethnic groups. Antonio Olmos (MHCD) Susan R. Hutchinson (UNC)

The Psychometric Properties of Dispositional Flow Scale-2 in Internet Gaming

Panel: Using Structural Equation Modeling (SEM) Using Partial Least Squares (SmartPLS)

Assessing the Reliability and Validity of Online Tax System Determinants: Using A Confirmatory Factor Analysis

Lec 02: Estimation & Hypothesis Testing in Animal Ecology

Item Analysis & Structural Equation Modeling Society for Nutrition Education & Behavior Journal Club October 12, 2015

A critical look at the use of SEM in international business research

Assessing Measurement Invariance of the Teachers Perceptions of Grading Practices Scale across Cultures

1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp

Empirical Knowledge: based on observations. Answer questions why, whom, how, and when.

CHAPTER 3 METHOD AND PROCEDURE

sample of 85 graduate students at The University of Michigan s influenza, benefits provided by a flu shot, and the barriers or costs associated

11/24/2017. Do not imply a cause-and-effect relationship

Confirmatory Factor Analysis of Preschool Child Behavior Checklist (CBCL) (1.5 5 yrs.) among Canadian children

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions

An Assessment of the Mathematics Information Processing Scale: A Potential Instrument for Extending Technology Education Research

Title: The Theory of Planned Behavior (TPB) and Texting While Driving Behavior in College Students MS # Manuscript ID GCPI

Department of Educational Administration, Allameh Tabatabaei University, Tehran, Iran.

Chapter 4 Data Analysis & Results

To link to this article:

David O Malley, Ph.D., LISW Case Western Reserve University Cleveland, Ohio

A Factorial Validation of Internship Perception Structure: Second-Order Confirmatory Factor Analysis

LISREL analyses of the RIASEC model: Confirmatory and congeneric factor analyses of Holland's self-directed search

Justice Context and Changes in Fairness-Related Criteria Over Time

Jumpstart Mplus 5. Data that are skewed, incomplete or categorical. Arielle Bonneville-Roussy Dr Gabriela Roman

Measuring the User Experience

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions

The Relative Performance of Full Information Maximum Likelihood Estimation for Missing Data in Structural Equation Models

While many studies have employed Young s Internet

Comparing Factor Loadings in Exploratory Factor Analysis: A New Randomization Test

The Development of Scales to Measure QISA s Three Guiding Principles of Student Aspirations Using the My Voice TM Survey

Can We Assess Formative Measurement using Item Weights? A Monte Carlo Simulation Analysis

Measuring and Assessing Study Quality

A Modification to the Behavioural Regulation in Exercise Questionnaire to Include an Assessment of Amotivation

Statistical analysis DIANA SAPLACAN 2017 * SLIDES ADAPTED BASED ON LECTURE NOTES BY ALMA LEORA CULEN

Technical Specifications

Unit 1 Exploring and Understanding Data

5/22/12. First Preventive Dental Exam: Disparities in Need Cost + Behavioral Insights! Mini-tour of Milwaukee! Acknowledgements!

9 research designs likely for PSYC 2100

Doctoral Dissertation Boot Camp Quantitative Methods Kamiar Kouzekanani, PhD January 27, The Scientific Method of Problem Solving

Score Tests of Normality in Bivariate Probit Models

The Bilevel Structure of the Outcome Questionnaire 45

Prepared by: Assoc. Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies

An Empirical Study on Causal Relationships between Perceived Enjoyment and Perceived Ease of Use

A Research about Measurement Invariance of Attitude Participating in Field Hockey Sport

How few countries will do? Comparative survey analysis from a Bayesian perspective

Alternative and Integrative Medicine. Attitude Questionnaire (CAIMAQ)

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria

A confirmatory factor analysis of the Bath County computer attitude scale within an Egyptian context: Testing competing models

Reliability. Internal Reliability

Self-Regulation of Academic Motivation: Advances in Structure and Measurement

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Ecological Statistics

Multifactor Confirmatory Factor Analysis

International Conference on Humanities and Social Science (HSS 2016)

Analysis and Interpretation of Data Part 1

Book review. Conners Adult ADHD Rating Scales (CAARS). By C.K. Conners, D. Erhardt, M.A. Sparrow. New York: Multihealth Systems, Inc.

Large Type Fit Indices of Mathematics Adult Learners: A Covariance Structure Model

Do People Care What s Done with Their Biobanked Samples?

HPS301 Exam Notes- Contents

MMI 409 Spring 2009 Final Examination Gordon Bleil. 1. Is there a difference in depression as a function of group and drug?

Validity of scales measuring the psychosocial determinants of HIV/STD-related risk behavior in adolescents

PTHP 7101 Research 1 Chapter Assignments

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

A Study of a Diet Improvement Method for Controlling High Sodium Intake Based on Protective Motivation Theory

Relationships between stage of change for stress management behavior and perceived stress and coping

Equivalence of Measurement Instruments for Attitude Variables in Comparative Surveys, Taking Method Effects into Account: The Case of Ethnocentrism

2013 Supervisor Survey Reliability Analysis

COMPUTING READER AGREEMENT FOR THE GRE

Social Determinants and Consequences of Children s Non-Cognitive Skills: An Exploratory Analysis. Amy Hsin Yu Xie

Isabel Castillo, Inés Tomás, and Isabel Balaguer University of Valencia, Valencia, Spain

Still important ideas

Self-Oriented and Socially Prescribed Perfectionism in the Eating Disorder Inventory Perfectionism Subscale

Issues in Information Systems Volume 17, Issue II, pp , 2016

Transcription:

Structural Equation Modeling: A Multidisciplinary Journal ISSN: 1070-5511 (Print) 1532-8007 (Online) Journal homepage: http://www.tandfonline.com/loi/hsem20 Using Structural Equation Modeling to Test for Differential Reliability and Validity: An Empirical Demonstration Ruth Raines-Eudy To cite this article: Ruth Raines-Eudy (2000) Using Structural Equation Modeling to Test for Differential Reliability and Validity: An Empirical Demonstration, Structural Equation Modeling: A Multidisciplinary Journal, 7:1, 124-141, DOI: 10.1207/S15328007SEM0701_07 To link to this article: https://doi.org/10.1207/s15328007sem0701_07 Published online: 19 Nov 2009. Submit your article to this journal Article views: 1797 View related articles Citing articles: 47 View citing articles Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalinformation?journalcode=hsem20 Download by: [37.44.202.204] Date: 07 December 2017, At: 01:59

STRUCTURAL EQUATION MODELING, 7(1), 124 141 Copyright 2000, Lawrence Erlbaum Associates, Inc. TEACHER S CORNER Using Structural Equation Modeling to Test for Differential Reliability and Validity: An Empirical Demonstration Ruth Raines-Eudy Tulane University School of Social Work Structural equation modeling (SEM) techniques provide us with excellent tools for conducting preliminary evaluation of differential validity and reliability of measurement instruments among a comprehensive selection of population groups. This article demonstrates empirically an SEM technique for group comparison of reliability and validity. Data are from a study of 495 mothers attitudes toward pregnancy. Proportions of African American and White, married and unmarried, and Medicaid and non-medicaid mothers provided sample sizes large enough for group comparisons. Four hypotheses are tested: that factor structures are invariant between subgroups, that factor loadings are invariant between subgroups, that measurement error is invariant between subgroups, and that means of the latent variable are invariant between subgroups. Discussion of item distributions, sample size issues, and appropriate estimation techniques is included. The interrelated dilemmas of differential reliability and validity are inherent in many fields of social research due to the differing ethnic, socioeconomic, and cultural groups that comprise our populations of interest. Whether the goal of our research is theory testing or practical application, as social scientists we wish to develop measurement instruments that are simultaneously valid, reliable, and generalizable to populations as large and inclusive as possible. However, as noted by Blalock (1982), Requests for reprints should be sent to Ruth Raines-Eudy, Department of Health Services Administration, University of Arkansas at Little Rock, 2801 S. University Avenue, Little Rock, AR 72204. E-mail: rleudy@ualr.edu

TEACHER S CORNER 125 Whenever measurement comparability is a doubt, so is the issue of the generalizability of the corresponding theory. Although one may state a theory in a sufficiently general form that it may be applied in diverse settings, tests of this theory will require an assessment of measurement comparability. If the theory succeeds in one setting but fails in another, and if measurement comparability is in doubt, one will be in the unfortunate position of not knowing whether the theory needs to be modified, whether the reason for the differences lies in the measurement-conceptualization process, or both. (p. 30) What Blalock s comment implies, and what is often not explicitly addressed in published reports of research results, is that testing for differential validity and reliability among a comprehensive selection of population groups should be part of the preliminary evaluation of instruments used for social research. Structural equation modeling (SEM) techniques provide us with excellent tools for conducting this preliminary evaluation. This article presents an empirical demonstration of a technique for examining the reliability and validity of measurement instruments or scales both within and between pertinent subpopulations. For clarity, the technique will be illustrated with a one-factor measurement model and three sets of subpopulation comparisons; however it is equally applicable for use with multifactor models or with structural models. DEVELOPMENT OF THE MEASUREMENT MODEL Mueller s (1997) guidelines and Bollen s (1989) recommendations for specification of measurement models and their application to the current example will be reviewed here because they are crucial preliminary steps for estimation of between-group differences in reliability and validity. Mueller (1997) stressed the importance of theory and understanding of the substantive area for meaningful model construction. Bollen (1989) stated that the first steps in developing a measurement model should include a theoretical definition that guides selection of measures, identification of the latent variable or variables, formation of measures, and specification of their relationship to the latent variable (p. 180). The measurement model chosen for this demonstration of testing differential reliability and validity originated from a larger, theoretically driven study investigating women s health beliefs related to pregnancy and prenatal care. The theoretical approach to specification of measurement models was the Health Belief Model (HBM) developed by Rosenstock (1974) and his associates (Becker & Maiman, 1983) for use with preventive health measures. The HBM is an applied value-expectancy theory with roots in social cognitive theory and social learning theory (Rosenstock, 1974). It is the most commonly applied theoretical approach in studies of prenatal care use, an area in which most studies are atheoretical. In its most parsimonious form, the HBM contains two

126 RAINES-EUDY main clusters of latent constructs: health threat and expected outcome. These constructs, along with exogenous demographic factors and cues to action such as media campaigns, predict the likelihood of obtaining preventive health care. The perceived susceptibility of individuals to health problems along with the perceived seriousness of those problems if left untreated constitute the health threat (Becker & Maiman, 1983). The individual then undertakes a cost benefit analysis of the ratio of benefits of care to barriers to care, which comprises the expected outcome (Rosenstock, 1991). In this study the HBM was applied to women s decisions to seek timely and adequate prenatal care. Latent variables measured included maternal and infant susceptibility and seriousness, two types of benefits (general and specific benefits), and three types of barriers (internal psychological, convenience, and health system barriers). The latent variable chosen for this demonstration of differential reliability and validity was one of the barrier constructs proposed by Bluestein and Rutledge in their 1993 conceptualization of the HBM as it applies to prenatal care seeking. They described the concept as internal psychological barriers to care(1993). This concept, which incorporates the mother s early feelings about her pregnancy, had not been previously tested in studies using the HBM. However, fairly consistent support had emerged in the literature for the importance of individual items related to the concept in determining the level of care obtained (Bluestein & Rutledge, 1992; Fisher et al., 1991; Poland, Ager, & Olson, 1987; Sable, Stockbauer, Schramm, & Land, 1990). These items were therefore selected as measures or indicators representing the latent variable, Feelings about Pregnancy. They were specified as effect indicators of the construct, with the causal direction going from the latent variable to the indicators, rather than the reverse (Bollen & Lennox, 1991). Figure 1 contains the originally hypothesized measurement model. RELIABILITY AND VALIDITY Because this was a new instrument, there were no preexisting measures of reliability. However, preliminary item scale correlations and Cronbach s alphas provided evidence of acceptable levels of reliability. Overall reliability extracted and calculated using the appropriate formula 1 was 0.91, indicating low overall error variance in the model. The reliability extracted measure ranges from 0 to 1, with values over 0.50 considered acceptable. 1 ρ ξ = ρ i = 1 λ xi ρ i = 1 2 λ xi 2 var( ξ) + var( ξ) ρ i = 1 var(δ t 1 ) (Dillon & Goldstein, 1984, p. 480).

TEACHER S CORNER 127 FIGURE 1 Hypothesized one-factor measurement model for Feelings about Pregnancy. The survey instrument was pilot tested twice, using samples of 28 and 25 women to establish face validity and appropriateness of question wording for the population. Content validity was established by a thorough review of the literature for items measuring attitudes toward pregnancy. Discriminant validity was established by estimating two factor measurement models with Feelings about Pregnancy as one of the factors and each of two other barriers latent variables, one for convenience barriers, and the other for health systems barriers. According to the literature, these are all discrete constructs that are not expected to be positively correlated, although they all present barriers to care. The models for intercorrelation were not supported, χ 2 (19, N = 495) = 52.85, p = 0.02 for convenience barriers, χ 2 (19, N = 495) = 34.62, p = 0.04 for health systems barriers. This indicates that Feelings about Pregnancy is indeed not correlated with measures of latent variables measuring different constructs within the same model (Bollen, 1989). Convergent validity was not possible to ascertain in the data set because there were no other latent variables that were hypothesized to be highly correlated with Feelings about Pregnancy.

128 RAINES-EUDY Criterion validity has been difficult to establish in this area of research, as there are no generally accepted criteria for measures of maternal health beliefs (Bates, Fitzgerald, & Wolinsky, 1994). However, two measures were available in the data set that provided support for the criterion validity of the construct. These were the mothers response to the item I started prenatal care late with this pregnancy and the actual number of visits recorded by the physician on the official American College of Obstetricians and Gynecologists form that was available for many of the mothers charts. The results of a structural model testing the relation between the construct Feelings about Pregnancy and an endogenous latent variable ( Poorcare ) comprising these two measures provided strong evidence for criterion validity, χ 2 (7, N = 460) = 8.78, p = 0.36 (adjusted goodness of fit index [AGFI] = 0.98; comparative goodness of fit index [CFI] = 1.00; root mean square error of approximation [RMSEA] = 0.02; RMSR = 0.04). The squared multiple correlation for the structural model was 0.32; the standardized gamma coefficient for the path from Feelings to Poorcare was 0.54 (t = 4.81). Finally, Dillon and Goldstein s (1984) formula 2 for calculating the shared variance of the indicators in a construct provided support for construct validity. The shared variance is called the variance extracted. It varies from 0 to 1, and it represents the ratio of the total variance that is due to the latent variable. According to Dillon and Goldstein and Bagozzi (1991), a variance extracted of greater than 0.50 indicates that the validity of both the construct and the individual variables is high. The shared variance for the Feelings about Pregnancy latent variable was 0.72. DIFFERENTIAL RELIABILITY AND VALIDITY The reliability of a measure is that part containing no purely random error (Carmines & Zeller, 1979). In SEM terms, the reliability of an indicator is defined as the variance in that indicator that is not accounted for by measurement error. It is commonly represented by the squared multiple correlation coefficient, which ranges from 0 to 1 (Bollen, 1989; Jöreskog & Sörbom, 1993a). However, because these coefficients are standardized, they are not useful for comparing reliability across subpopulations. Differential validity has been defined as differing test scores for differing subgroups of test takers (Cole & Moss, 1989). This can be detected in SEM models by comparison of factor loadings or unstandardized λ coefficients for the same mea- 2 ρvr( ξ) = ρ i = 1 λ xi ρ i = 1 2 λ xi 2 var( ξ) var( ξ) ρ i =1 var( δ ) t 1 (Dillon & Goldstein, 1984, p. 480).

TEACHER S CORNER 129 surement model estimated for different subpopulations, by visually inspecting the coefficients for the subpopulations (Bollen, 1989). The reliability and validity extracted formulas previously presented can be examined for rough estimates of the amount of error variance and degree of validity present in each subgroup. However, in order to draw meaningful comparisons in which statistically significant differences in subgroup factor structure, reliability, and validity are detected, more sophisticated multigroup methods are required. The example described here demonstrates empirically the SEM technique described by Jöreskog and Sörbom (1989) that can be used to detect whether measures contain significant amounts of either differential reliability or validity, depending on the populations in which they are used. The method systematically assesses several hypotheses comparing the factor structure, reliability, validity, and mean differences in latent variables of a measure as it is applied in different subpopulations. First, the factor pattern of the measurement model is hypothesized to be identical or invariant for each group (H ξ = 1). Second, the factor loadings or λ coefficients for the measurement model are hypothesized to be invariant across groups (H Λ ), indicating that there is not differential validity. Third, given that the λs are invariant, both these factor loadings and the error terms are hypothesized to be invariant across groups (H Λ Θ). If supported, this hypothesis provides evidence that reliabilities do not differ for differing subgroups. For the model presented here, one further hypothesis is tested. The means of latent variables for differing subpopulations are hypothesized to differ significantly. This would indicate that members of one subpopulation on average have a tendency to feel significantly more positively or negatively about their pregnancies. DATA AND METHODS Sample size is a critical concern when working with SEM measurement models, particularly when variables may not be normally distributed or when distributions of variables are not known in advance. For the current study, participants were 495 women who gave birth in the labor and delivery unit of a hospital affiliated with a medical school in a large Midwestern metropolitan area. Preliminary data indicated that its population was more representative of the metropolitan area population than those of other local hospitals. Twenty-four percent of mothers who delivered during the time of the study did not participate. However, there were no statistically significant differences in demographics between women who did or did not participate. The demographics of the study sample and the large sample size made it ideal for multigroup comparison using SEM. Subgroups available for comparison included African American mothers (58%) to White mothers (42%); married mothers (47%) to unmarried mothers (53%); and mothers with Medicaid (43%) to mothers with private insurance (57%). The full sample was interviewed using a structured survey instru-

130 RAINES-EUDY ment with items measuring perceptions of pregnancy and prenatal care. In order to rule out bias due to varying levels of literacy, the questionnaire was read to all participants, who then circled their responses to each item. ESTIMATION Mueller (1997) cautioned that, if the assumptions of multivariate normality are not met when using distribution-dependent methods such as maximum likelihood (ML), then biased estimates may result. PRELIS 2 provides estimates of the probability that variables are bivariate normal and an overall multivariate normality measure that should be taken into account when determining the proper estimator for a model (Jöreskog & Sörbom, 1993b). In cases where these assumptions are not met, the appropriate estimation technique is weighted least squares (WLS) rather than ML estimation (Jöreskog & Sörbom, 1989). Calculations are based on the polychoric correlation matrix rather than the covariance matrix. Values of the variables must be weighted by the inverse of the asymptotic covariance matrix in order to minimize the sum of squared deviations of the sample from the population (Bollen, 1989). The assumed underlying bivariate normality of the weighted polychoric correlation matrix must be confirmed before proceeding with model estimation (Muthen, 1993). Prior to conducting the tests for differential reliability and validity, the measurement model was estimated for the entire sample of 495 women using WLS estimation. An 11-point Likert scale was chosen in anticipation of a continuous response pattern. However, visual inspection of the variables revealed a bimodal distribution, with modes of 0 and 10 for all four items. Further analysis with PRELIS 2 revealed that the items did not meet assumptions of bivariate normality required for ML estimation. Accordingly, the items were dichotomized. This resulted in variables with nomorethan60%ofthecasesinonecategory,whichislesslikelytocreatebiasedestimates than are more unbalanced dichotomies (Pedhazur & Schmelkin, 1991). Tests in PRELIS indicated that this transformation yielded bivariate and multivariate normality for the polychoric correlations. Measurement models were then estimated within each of the groups to be compared to ascertain that the models held within each group (Byrne, Shavelson, & Muthen, 1989). The hypothesized measurement model was replicated in each of three sets of randomly selected subsamples: African American and White comparison, married and unmarried comparison, and Medicaid and private insurance comparison. All samples met the sample size requirement of N 200 for WLS estimation (Boomsma, 1987). All samples also met the assumptions of bivariate and multivariate normality for the polychoric correlations. Next, the hypotheses of equal factor pattern, equal factor loadings (λ coefficients), and equal error terms (Θδs) were tested using Jöreskog & Sörbom s (1989) guidelines. First, pattern structures were constrained to be invariant across

TEACHER S CORNER 131 two groups, with λs and Θ δ s allowed to vary. Second, both factor patterns and λs were constrained to be invariant with Θ δ s allowed to vary. Finally, all three were constrained to be invariant across groups. The group comparisons are designed to evaluate construct validity across subgroups. Because the group comparisons are not designed for hypothesis testing, estimation of three separate tests for construct validity across subgroups should not result in compounded probability of Type I error due to group member overlap. The subsamples are randomly drawn from a larger population in which these demographic groups do naturally overlap. AsTable1illustrates,themeansoftheitemsdifferedamongsubsamplesinanapparently nonrandom way. Unmarried women, African American women, and women whose primary insurer was Medicaid were more likely to view their pregnancies in a more negative light than were married women, White women, and women with private insurance. In order to determine whether these differences are statistically significant, it is necessary to conduct a test for equal means of the latent variable. However, estimation of mean differences poses a difficult problem when WLS estimation must be used with ordinal or nonnormal variables. With WLS the polychoric correlations are weighted by the inverse of the asymptotic covariance matrix, and the means are standardized to 0 (Jöreskog & Sörbom, 1993a). To test for the significance of the difference in means of latent variables requires ML estimation. The kappa (κ) coefficient is the mean vector of ξ. When κ is constrained to be 0 in the first group and allowed to vary in the second group, the difference in group means is given by the value of κ in the second group (Jöreskog & Sörbom, 1989). It is also possible to perform a test of the χ 2 difference in a model with κ held invariant in the second group compared to one in which it is allowed to vary in the second group. If the model with unequal means (κ free) is a better fit to the data than the model with equal means (κ invariant), then the hypothesis of equal means can be rejected. Results of the ML estimation must be interpreted with caution, however, when it is used with nonnormally distributed variables, as is the case here. TABLE 1 Means of Indicators for Subsamples x 1 x 2 x 3 x 4 African American 4.85 4.70 3.80 6.90 White 2.75 3.05 2.78 4.34 Unmarried 5.39 5.48 5.48 7.78 Married 2.35 2.26 2.26 3.60 Medicaid 5.25 5.25 4.02 7.41 Private insurance 2.60 2.73 1.90 4.19

132 RAINES-EUDY RESULTS The estimated model for the full sample of 495 mothers yielded evidence of reliability and construct validity. The chi square for the measurement model using the full sample was 3.58 (p = 0.17), with an AGFI of 0.99, CFI of 1.00, RMSR of 0.02, and RMSEA of 0.04. Standardized λ coefficients (validity coefficients) for the indicators ranged from 0.81 to 0.93, with highly significant t values. Squared multiple correlations (reliabilities) ranged from 0.76 to 0.86. Pearson correlations among items ranged from 0.26 to 0.54. As previously noted, the reliability extracted was 0.91, and the variance extracted (validity) was 0.72. All hypothesized measurement models for the individual subsamples were supported, with acceptably low chi squares, all other fit statistics well above 0.90, and low levels of overall model error indicated in the RMSEA. Figures 2 FIGURE 2 (N = 272). Measurement model for African American subsample with t values in parentheses

TEACHER S CORNER 133 FIGURE 3 Measurement model for White subsample with t values in parentheses (N = 205). through 7 contain the overall model fit statistics and individual model parameters for each of these pairs of subsamples. The parameter estimates for the factor loadings (λs) were all significant and fairly high. There were few differences in λs between subgroups, with two notable exceptions. The λs for indicator X 3 differed more than the λs for other indicators for all subgroup comparisons. Additionally, the error terms and squared multiple correlations appeared to differ in several of the subgroup comparisons. These preliminary results indicated that it was necessary to proceed to multigroup comparisons of factor structure, factor loadings, and error terms. Tables 2 through 4 contain the results of the tests of the first three hypotheses for each subgroup comparison (H ξ =1,H Λ,H Λ Θ). The decision to support each hypothesis, reported at the bottom of the table, was based on several overall measures of fit (χ 2, goodness of fit index [GFI], AGFI, normed fit index [NFI], CFI, and RMSEA) as recommended by Tanaka (1993), rather than on a single measure. All

FIGURE 4 Measurement model for married subsample with t values in parentheses (N = 233). TABLE 2 Model Fit Statistics for Comparison of African American and White Subsamples for Hypotheses 1 Through 3. Model Fit Statistics Hypothesis 1 Equal Factor Structure (H ξ = 1) Hypothesis 2 Equal Factor Loadings (H Λ ) Hypothesis 3 Equal Factor Loadings and Error Terms (H Λ Θ) Chi square (df, p) 4.99 (4, 0.29) 9.79 (7, 0.20) 10.41 (11, 0.49) Goodness-of-fit index 1.00 0.99 0.99 Adjusted goodness-of-fit index 0.99 0.98 0.99 Normed fit index 0.99 0.99 0.99 Comparative fit index 1.00 1.00 1.00 Root mean square error of 0.02 0.03 0.00 Approximation Decision based on fit Supported Supported Supported 134

FIGURE 5 Measurement model for unmarried subsample with t values in parentheses (N = 256). TABLE 3 Model Fit Statistics for Comparison of Married and Unmarried Subsamples for Hypotheses 1 Through 3 Model Fit Statistics Hypothesis 1 Equal Factor Structure (H ξ = 1) Hypothesis 2 Equal Factor Loadings (H Λ ) Hypothesis 3 Equal Factor Loadings and Error Terms (H Λ Θ) Chi square (df, p) 3.75 (4, 0.44) 7 26 (7, 0.40) 8.57 (11, 0.66) Goodness-of-fit index 1.00 1.00 0.99 Adjusted goodness-of-fit index 0.99 0.98 0.98 Normed fit index 0.99 0.98 0.98 Comparative fit index 1.00 1.00 1.00 Root mean square error of 0.00 0.01 0.01 approximation Decision based on fit Supported Supported Supported 135

FIGURE 6 Measurement model for Medicaid subsample with t values in parentheses (N = 208). TABLE 4 Model Fit Statistics for Comparison of Medicaid and Private Insurance Subsamples for Hypotheses 1 Through 3 Model Fit Statistics Hypothesis 1 Equal Factor Structure (H ξ = 1) Hypothesis 2 Equal Factor Loadings (H Λ ) Hypothesis 3 Equal Factor Loadings and Error Terms (H Λ Θ) Chi square (df, p) 3.00 (4, 0.50) 6 05 (7, 0.53) 7.43 (11, 0.66) Goodness-of-fit index 1.00 1.00 1.00 Adjusted goodness-of-fit index 0.99 0.99 0.99 Normed fit index 0.99 0.99 0.99 Comparative fit index 1.00 1.00 1.00 Root mean square error of 0.00 0.00 0.00 approximation Decision based on fit Supported Supported Supported 136

TEACHER S CORNER 137 FIGURE 7 (N = 241). Measurement model for private insurance subsample with t values in parentheses hypotheses were supported for each subgroup comparison. This allows us to conclude with some degree of certainty that the factor structure, reliability, and validity for this measure of feelings about pregnancy did not differ significantly for African American and White mothers, married and unmarried mothers, and for mothers with private insurance compared to mothers with Medicaid. The test for mean differences using ML estimations, however, indicated that the mean values of ξ did differ significantly for all three subgroup comparisons. Table 5 contains the results of nested models. The chi-square differences in all comparisons (κ free compared with κ invariant) were significant, indicating that models with unequal means were a better fit than models with equal means. Again it should be noted that these estimations were performed with ML and should be interpreted tentatively.

138 RAINES-EUDY TABLE 5 Tests for Hypothesis 4, Equality of Means for Three Sets of Subsamples (Kappa Invariant vs Kappa Free) Subgroups χ 2 df κ African American and White 12.43* 1 0.37 Unmarried and married 21.84* 1 0.29 Medicaid and private insurance 11.52* 1 0.28 *p <.001. DISCUSSION This comparison of measurement models for feelings about pregnancy between three demographic subgroups of mothers demonstrates a method to determine the suitability of measurement instruments for a wide variety of population subgroups. In the example reported here, no significant differences in validity or reliability of the hypothesized measurement model, Feelings about Pregnancy, were found for any of the three subgroup comparisons. Practically, this means that the measurement model is valid and reliable for use within these subpopulations. The findings also make it possible to test the hypothesis that attitudes or feelings about pregnancy may play an intervening role in the relationship between exogenous variables such as poverty, ethnicity, and marital status, and mothers behavior in beginning and maintaining adequate prenatal care during their pregnancies. The findings regarding equal means are less conclusive. In order to estimate this model, it was necessary to use the original, continuous metric with ML estimation. Because the item responses were bimodal, the excessive kurtosis of the continuous variables may have influenced the chi square or other overall fit estimates (Bollen, 1989). However, estimates of λs and δs in the model using ML were consistent with those calculated using WLS. In any event, this illustrates the importance of taking into account the clinical or social significance of findings when considering their statistical significance. The fact that African American mothers, mothers who were not married, and mothers whose medical bills were paid by Medicaid were all less likely to view their pregnancies favorably in the early stages is an important substantive finding, regardless of its statistical significance. It has implications for further theoretical work using the HBM to predict or explain prenatal care use. Testing for differential reliability and validity is a practical preliminary test prior to the use of survey instruments for the social sciences, given the growing diversity and multicultural nature of American society. Used in pilot tests for new survey instruments, the method could help to ensure that study find-

TEACHER S CORNER 139 ings are not biased due to flaws in the instruments used to measure attitudes, opinions, beliefs, and other variables that may hold different meanings in different segments of our pluralistic society. Used retrospectively, this method would be particularly useful with large data sets such as the General Social Survey, General Accounting Office Surveys, or any other nationally representative surveys. Many extant data sets contain variables with ordinal or dichotomous scaling that do not meet the assumptions of ML. In these cases WLS will be required. This has implications for the determination of sample size for reliability and validity studies in these data sets. When planning a new study that includes pretests for differential reliability and validity, sample size requirements become even more critical to address. The findings reported here came from a sample in which the demographic categories were fairly evenly split, and the sample size was large enough to facilitate comparisons. Thus, even though the variable distribution dictated the use of WLS, sample sizes of at least 200 were available for the subgroups. This may not always be the case. If the metropolitan population had been more representative of the country as a whole, the sample size required to compare African American and White mothers using WLS would have been approximately 1700. In this case, a carefully planned oversampling of the underrepresented demographic groups might be one way to reduce the overall sample size. Sample size is less problematic when variables are normally distributed and ML estimation can be used. Bollen (1989) recommended at least several cases per free parameter (p. 268), which would reduce the sample size requirement for a one-factor model with four indicators considerably. When planning sample size, it is important for the researcher to consider whether or not all item responses to a given questionnaire will be normally distributed. If views are polarized on an issue, it is likely that variable distributions will be bimodal, or at least have high levels of kurtosis. In other cases, the majority of respondents may have a positive response to some items, but a significant number of outlying opinions may skew the distribution, making it unsuitable for ML estimation. A careful review of research findings using the same or similar questions may shed some light on the expected distributions. However, unless strong evidence suggests that all responses will have normal distributions, and that variables within a scale will have bivariate and multivariate normality, the safe approach appears to be to assume that WLS will be required and to plan sample size accordingly. ACKNOWLEDGMENTS Work for this article was supported by a grant from the Tulane University Committee on Research. I wish to thank David F. Gillespie and the anonymous reviewers for their suggestions.

140 RAINES-EUDY REFERENCES Bagozzi, R. P. (1991). Further thoughts on the validity of measures of elation, gladness, and joy. Journal of Personality and Social Psychology, 61, 98 104. Bates, A. S., Fitzgerald, J. F., & Wolinsky, F. D. (1994). Reliability and validity of an instrument to measure maternal health beliefs. Medical Care, 32, 832 846. Becker, M. H., & Maiman, L. A. (1983). Models of health-related behavior. In D. Mechanic, Handbook of health, heath care, and the health profession (pp. 539 566). New York: Free Press. Blalock, H. M. (1982). Conceptualization and measurement in the social sciences. Newbury Park, CA: Sage. Bluestein, D., & Rutledge, C. M. (1992). Determinants of delayed pregnancy among adolescents. The Journal of Family Practice, 35, 406 410. Bluestein, D., & Rutledge, C. M. (1993). Psychosocial determinants of late prenatal care: The health belief model. Family Medicine, 25, 269 272. Bollen, K. A. (1989). Structural equations with latent variables. New York: Wiley. Bollen, K., & Lennox, R. (1991). Conventional wisdom on measurement: A structural equation perspective. Psychological Bulletin, 110, 305 314. Boomsma, A. (1987). The robustness of maximum likelihood estimation in structural equation models. In P. Cuttance & R. Ecob (Eds.), Structural equation modeling by example (pp. 160 188). Cambridge, England: Cambridge University Press. Byrne, B. M., Shavelson, R. J., & Muthen, B. O. (1989). Testing for equivalence of factor covariance and mean structures: The issue of partial measurement invariance. Psychological Bulletin, 105, 456 466. Carmines, E. G., & Zeller, R. A. (1979). Reliability and validity assessment. In E. G. Carmines (Ed.), Sage university papers series on quantitative applications in the social sciences (pp. 107 117). Newbury Park, CA: Sage. Cole, N., & Moss, P. (1989). Bias in test use. In Robert L. Linn (Ed.), Educational measurement (3rd ed.), New York: Macmillan. Dillon, W., & Goldstein, M. (1984). Multivariate analysis: Methods and applications. New York: Wiley. Fisher, M. -J., Ewigman, B., Campbell, J., Benfer, R., Furbee, L., & Zweig, S. (1991). Cognitive factors influencing women to seek care during pregnancy. Family Medicine, 23, 443 446. Jöreskog, K. G., & Sörbom, D. (1989). LISREL 7 user s guide. Chicago: Scientific Software International. Jöreskog, K. G., & Sörbom, D. (1993a). LISREL 8: Structural equation modeling using the SIMPLIS command language. Chicago: Scientific Software International. Jöreskog, K. G., & Sörbom, D. (1993b). New features in PRELIS2. Chicago: Scientific Software International. Mueller, R. O. (1997). Structural equation modeling: Back to basics. Structural Equation Modeling, 4, 352 369. Muthen, B. O. (1993). Goodness of fit with categorical and other nonnormal variables. In K. A. Bollen & J. S. Long (Eds.), Testing structural equation models (pp. 205 234). Newbury Park, CA: Sage. Pedhazur, E. J., & Schmelkin, L. P. (1991). Measurement, design, & analysis: An integrated approach. Hillsdale, NJ: Lawrence Erlbaum Associates, Inc. Poland, M. L., Ager, J. W., & Olson, J. M. (1987). Barriers to receiving adequate prenatal care. American Journal of Obstetrics and Gynecology, 157, 297 303. Rosenstock, I. M. (1974). Historical origins of the health belief model. Health Education Monographs, 2, 328 335.

TEACHER S CORNER 141 Rosenstock, I. M. (1991). The health belief model: Explaining health behavior through expectancies. In K. Glanz, F. M. Lewis, & B. K. Rimer (Eds.), Health behavior and health education: Theory, research, and practice (pp. 39 61). San Francisco: Jossey-Bass. Sable, M. R., Stockbauer, J. W., Schramm, W. F., & Land, G. H. (1990). Differentiating the barriers to adequate prenatal care in Missouri, 1987 1988. Public Health Reports, 105, 549 555. Tanaka, J. S. (1993). Multi facet conceptions of fit in structural equation models. In K. Bollen & J. S. Long (Eds.), Testing structural equation models. Newbury Park, CA: Sage.