Measurement Models for Behavioral Frequencies: A Comparison Between Numerically and Vaguely Quantified Reports. September 2012 WORKING PAPER 10

WORKING PAPER 10 BY JAMIE LYNN MARINCIC Measurement Models for Behavioral Frequencies: A Comparison Between Numerically and Vaguely Quantified Reports September 2012

Abstract Surveys collecting behavioral frequency information request either numerically or vaguely quantified frequency reports. Regardless of the nature of the behavioral frequency report, researchers often place respondents on a latent behavioral frequency continuum based on responses to a set of related items. To the extent that these different reports capture the same frequency information, measurement models based on either should not differ; however, whereas numerically quantified reports reflect absolute frequency information, the literature has demonstrated that vaguely quantified reports reflect relative or conditional frequency information. Therefore, measurement models based on numerically and vaguely quantified reports may differ in important ways. Using data from an experiment embedded in the 2006 National Survey of Student Engagement, the current study finds that the structure of measurement models based on vaguely and numerically quantified data differs; thus, the measurement of behavioral frequency reports directly influences substantive findings. One potential explanation for differences in model outcomes is the differential interpretation of vague quantifiers. Next, the study addresses whether latent variable models can be used to detect differential interpretation by identifying a continuous latent interpretation factor (via a random intercept item factor model) or categorical latent interpretation classes (via a factor mixture model). Accounting for differential interpretation via a random intercept model did not improve the comparability of measurement models; however, a two-class factor mixture model did improve comparability. These analyses illustrate how researchers can use certain latent variable models to extract a methodological artifact differential interpretation from measurement models intended to be purely substantive. iii

CONTENTS I MEASUREMENT MODELS FOR BEHAVIORAL FREQUENCIES... 1 A. Introduction... 1 B. Background... 2 C. Data... 5 II INFLUENCES OF DIFFERENTIAL INTERPRETATION ON MEASUREMENT MODELS... 8 A. ANALYTIC STRATEGY... 8 B. RESULTS... 8 1. Numerically Quantified Reports... 8 2. Vaguely Quantified Reports... 9 a. Interval Treatment... 9 b. Ordinal Treatment... 13 F. MODEL COMPARISONS... 13 1. Scenario One... 14 2. Scenario Two... 16 G. DISCUSSION... 16 III DETECTION OF DIFFERENTIAL INTERPRETATION BY MEASUREMENT MODELS... 17 A. ANALYTIC STRATEGY... 17 B. RESULTS... 17 1. Random Intercept Factor Model... 17 2. Factor Mixture Model... 19 C. DISCUSSION... 24 IV CONCLUSION... 25 REFERENCES... 26 FIGURE 1 Example Experimental Question (from Nelson Laird et al. 2008)... 6 iv

TABLES I.1 Experimental NSSE Item Descriptions... 6 II.1 II.2 II.3 II.4 II.5 II.6 II.7 II.8 II.9 Pearson Partial Correlation Matrix of Natural Log of Numerically Quantified Yearly Behavioral Frequencies (N = 8,174)... 10 Confirmatory Factor Model Fit Indices: Numerically Quantified Reports... 10 Pearson Correlation Matrix of Vaguely Quantified Yearly Behavioral Frequencies (N = 8,174)... 11 Confirmatory Factor Model Fit Indices: Vaguely Quantified Reports (continuous)... 11 Reference Model Fit Indices: Vaguely Quantified Reports (continuous)... 12 Polychoric Correlation Matrix of Vaguely Quantified Yearly Behavioral Frequencies (N = 8,174)... 12 Confirmatory Factor Model Fit Indices: Vaguely Quantified Reports (ordinal)... 14 Reference Model Fit Indices: Vaguely Quantified Reports (ordinal)... 14 Standardized Factor Loadings: Confirmatory Factor Model Comparisons... 15 II.10 Factor Correlations: Confirmatory Factor Model Comparisons... 15 II.11 Standardized Factor Loadings: Reference Model Comparisons... 16 II.12 Factor Correlations: Reference Model Comparisons... 16 III.1 III.2 Random Intercept Item Factor Model Fit Indices: Vaguely Quantified Reports (continuous)... 18 Random Intercept Item Factor Model Fit Indices: Vaguely Quantified Reports (ordinal)... 19 III.3 Latent Class Model Fit Indices... 20 III.4 Latent Class and Conditional Probabilities... 21 III.5 Factor Mixture Model Fit Indices... 22 III.6 Class-Specific Item Threshold Estimates... 23 v

A. Introduction I. MEASUREMENT MODELS FOR BEHAVIORAL FREQUENCIES Surveys collecting behavioral frequency information request either numerically or vaguely quantified frequency reports. Numeric frequency reports may take on a variety of forms, including a single, total frequency for a given reference period (for example, 36 times in the past year) or a rate per unit time (for example, three times a month). Furthermore, the response for either form could be open ended (that is, fill-in-the-blank) or consist of a scale of category ranges (for example, 0 times, 1 2 times, 3 5 times, 6 10 times, more than 10 times). Vaguely quantified frequency reports are collected using a scale of vague quantifiers (for example, never, sometimes, often, very often). Regardless of the nature of the behavioral frequency report (that is, numeric or vaguely quantified), researchers are often interested in placing respondents on a latent behavioral frequency continuum based on a set of related items. To the extent that numerically and vaguely quantified reports capture the same frequency information, measurement models based on either should not differ; however, whereas numerically quantified reports reflect absolute behavioral frequency information, the literature has demonstrated that vaguely quantified frequency reports reflect relative or conditional behavioral frequency information. For example, researchers have found that the numeric definition of vague quantifiers is influenced by survey characteristics such as the perceived target population (Wanke 2002) and the visual design of the response scale (Christian 2003; Christian and Dillman 2004; Christian et al. 2008; Schwarz et al. 1991); item characteristics such as the base-rate frequency of a behavior (how frequently the behavior occurs in the population; Pepper and Prytulak 1974); and respondent characteristics such as level of engagement in the behavior (Goocher 1965), attitude toward the behavior (Goocher 1965), and demographic characteristics (Schaeffer 1991). Therefore, measurement models based on numerically and vaguely quantified frequency reports may differ in important ways. The purpose of the current study is to determine whether the structure of the substantive factor(s) extracted from numerically quantified reports differs from the structure extracted from vaguely quantified reports. Furthermore, the study examines whether factor mixture models applied to only the vaguely quantified reports successfully reproduce the factor structure(s) extracted from the numerically quantified reports. The purpose of these analyses is to determine whether researchers can use any of the available latent variable models to extract a methodological artifact differential interpretation from measurement models intended to be purely substantive (see Leite and Cooper 2010). B. Background The decision to collect numerically or vaguely quantified frequency reports might be based on a variety of considerations, including characteristics of the respondent and behavior that influences the accessibility of behavioral frequency information in the respondent s autobiographical memory, the reference period, and the study s analytic goals. Factors affecting the accessibility of behavioral frequency information include the respondent s level of engagement in the behavior (how frequently and regularly the respondent engages in it; Blair and Burton 1987; Burton and Blair 1991; Schwarz 1990) and characteristics of the behavior itself (how similar particular instances of the behavior are; Means and Loftus 1991; Menon 1993). How behavioral frequency information is stored in the respondent s autobiographical memory influences the retrieval strategy a respondent uses. First, consider the respondent s level of engagement in the behavior. The more frequently a respondent 1

engages in a particular behavior, the greater the number of instances to recall (and vice versa). When the number of instances to recall is small, each instance is more distinct in the respondent s memory than when the number of instances to recall is large. Therefore, respondents are more likely to retrieve behavioral frequency information by recalling and counting each instance (episode enumeration) when the number of instances is small than when it is large (Blair and Burton 1987). As the number of instances increases, individual instances become less distinguishable in the respondent s memory; thus, respondents are less likely to enumerate and more likely to use a ratebased or general impression strategy (Blair and Burton 1987; Burton and Blair 1991; Schwarz 1990; for a review, see Sudman et al. 1996). Finally, when the number of instances is too large to count and a rate-based strategy is not effective, respondents can use the general impression strategy. Therefore, when numeric frequency information is accessible in the respondent s memory, numeric response options can be used; however, when numeric frequency information is not readily accessible in the respondent s memory, vaguely quantified response options might be preferable. Second, characteristics of the behavior itself (such as its regularity, similarity, and seriousness) have been shown to influence the storage of behavioral frequency information in autobiographical memory and thus respondent retrieval strategies. Specifically, as instances become more regular and similar, the representation of any particular instance becomes less distinguishable in one s memory. That is, the episodic information stored for each instance is similar, causing events to blur together. Therefore, respondents are more likely to use a rate-based strategy than episode enumeration (Means and Loftus 1991). For example, consider asking a respondent how many times she brushes her teeth. Because the respondent brushes her teeth at similar times (morning and night) in the same location (her bathroom), the episodic information is nearly identical, making the recall of any particular instance of teeth brushing difficult. Note, however, that regularity and similarity are correlated, and thus it is difficult to disentangle whether it is the regularity or similarity of instances that leads to rate-based estimation (Dykema and Schaeffer 2006; Means and Loftus 1991). Consider also the seriousness of a particular instance (Menon 1993). The more serious an instance (an emergency room visit versus a routine checkup), the more salient the instance is in memory. Furthermore, because serious instances tend to be neither regular nor similar (see Means and Loftus 1991), episodic information is easy to retrieve (as discussed previously). As a result, respondents are more likely to use episode enumeration than other retrieval strategies when recalling serious instances (Menon 1993). Characteristics of the survey context, including the reference period, may also affect the retrieval strategy used (Blair and Burton 1987) and thus whether numerically or vaguely quantified reports are requested. As the reference period increases, the number of instances likely also increases. Therefore, when the reference period is long, episode enumeration is less likely to be used in favor of a rate-based or general impression strategy (Blair and Burton 1987). When a rate can be calculated, numeric information may be requested; however, when only a general impression can be retrieved, vaguely quantified reports should be requested. Research has specifically addressed the influence of retrieval strategy and characteristics of the behavior and the respondent on the accuracy of behavioral frequency reports. Among the three retrieval strategies discussed, rate-based estimation appears to result in the most accurate frequency reports (Menon 1993). Such a strategy can, however, be susceptible to overestimation if respondents forget to subtract exceptions to the rate. Generally speaking, episode enumeration is most susceptible to underestimation because when retrieving individual events, a respondent is more likely to omit an actual event than to erroneously remember a false one (Burton and Blair 1991). 2

Recent research has jointly considered characteristics of the behavior (for example, frequency, regularity, similarity, and seriousness) and their relationship to recall accuracy. Dykema and Schaeffer (2006) reorganized characteristics of the behavior into three categories: complexity (including frequency, regularity, and similarity); clarity (how similar one event category is to another); and emotionality (intensity and valence). In the context of child support payments, the researchers found that recall was less accurate when the nature of the child support payments was complex, indistinct from other payment types (such as alimony), and emotionally neutral. Furthermore, they found that payment characteristics were better predictors of recall accuracy than were memory decay; respondent demographic characteristics (such as age, education, and gender); social desirability; and respondent motivation. Lastly, although Dykema and Schaeffer (2006) found that respondent demographic characteristics were not the best predictors of recall accuracy, subsequent studies have emphasized the influence of other respondent characteristics, such as respondent numeracy. The U.S. Department of Education s Institute of Education Sciences Program for the International Assessment of Adult Competencies (PIAAC) defines numeracy as the ability to assess, use, interpret, and communicate mathematical information and ideas, to engage in and manage mathematical demands of a range of situations in adult life (PIAAC Numeracy Expert Group 2009). In a recent study of the accuracy of reports of sexual activity, McAuliffe et al. (2010) found that discrepancies between diary and survey reports of activity increased with lower numeracy and lower education (less than high school compared with more than high school). Furthermore, as the frequency of activity increased, so too did the discrepancy between reports. Clearly, behavioral frequency reports are complex and even numerically quantified reports are susceptible to response errors. It is not surprising then that vaguely quantified reports are similarly complex. That being said, vaguely quantified response options can be appropriate in many situations, as discussed previously. For example, frequently occurring behaviors that are regular, similar, and relatively mundane or emotionally neutral are likely not individually accessible in the respondent s memory (Means and Loftus 1991; Menon 1993). Such instances can be recalled using a rate-based strategy; however, if the behavior is irregular, the respondent is likely to retrieve a general impression better represented by a vague quantifier than a numeric value. Even when events are individually accessible in the respondent s memory, researchers can still use vague quantifiers to reduce respondent burden (Schaeffer and Charng 1991). Researchers analyzing vaguely quantified reports often make two key assumptions, which, if incorrect, have important implications. First, researchers often assume the inter-category distances between vague quantifiers are equal, and thus they use analytic strategies appropriate for intervallevel data. If in fact only an ordinal-level of measurement has been obtained, the structure of latent factors extracted from measurement models assuming interval-level measurement might be incorrect. Second, researchers also assume that the interpretation of vague quantifiers is identical across respondents and items. If interpretation varies among individuals, detected differences on a factor (such as student engagement) might be due to differences in interpretation rather than true differences on the factor. Both assumptions are testable. In order to better understand such a phenomenon, consider the following item adapted from the National Survey of Student Engagement (2006): In your experience at your institution during the current school year, about how often have you discussed ideas from your readings or classes with faculty members outside of class? Would you say never, sometimes, often, or very often? 3

Research on the interpretation of vague quantifiers has demonstrated that interpretation might vary based on how often a respondent actually engages in the behavior (that is, a respondent s level of engagement). Specifically, Goocher (1965) found that participants who did not frequently engage in an activity selected a higher frequency vague quantifier for a given numeric frequency than did participants who engaged in the activity more frequently. In the context of the student engagement example, a possible hypothesis is that freshmen meet with faculty outside of class less often than do seniors because freshmen are just orienting themselves to the environment of postsecondary education, whereas seniors are more familiar with that environment. Given such a difference in actual engagement, a freshman might report that meeting with faculty outside of class four times in a semester is often, whereas a senior might report that this same numeric frequency is only sometimes. In fact, Nelson Laird et al. (2008) examined the interpretation of vague quantifiers among freshmen and seniors in the context of student engagement and found the difference illustrated here. The researchers have advised users of the National Survey of Student Engagement (NSSE) data to report freshmen and senior engagement results separately so as not to confound group differences in behavior with group differences in interpretation. An extensive body of literature investigates the interpretation of vague quantifiers of behavioral frequency by analyzing paired frequency reports (that is, reports that have been quantified both vaguely and numerically); however, the effects of undetected or unaccounted for differences in interpretation on measurement models have not yet been explored. The current study addresses this gap by analyzing data collected from an experiment embedded in the 2006 NSSE. 1 We selected this data set because it contains paired behavioral frequency reports in both numerically and vaguely quantified terms. In Chapter II, we compare the structure of the measurement model estimated from the numerically quantified frequency reports with the structure of measurement models estimated from the vaguely quantified frequency reports. A potential explanation for discrepancies between the models is the influence of individual variability in vague quantifier interpretation. Chapter III tests this hypothesis. Specifically, we consider the measurement model based on the numerically quantified data to be the reference model because it is not subject to differences in respondent interpretation. We estimate certain factor mixture models on the vaguely quantified reports to determine whether, after accounting for a differential interpretation latent factor or differential interpretation latent classes, the substantive component of the measurement model aligns with the reference model. C. Data In order to meet the objectives of the current study, a data set must include behavioral frequencies quantified both vaguely and numerically. Vaguely quantified behavioral frequencies can be obtained from a survey, whereas, when records of behavioral frequencies exist, such records constitute the gold standard measure of numeric frequency (that is, a respondent s true behavioral frequency). In the absence of such records, respondents numeric reports 2 can be used as the gold standard. Researchers at the Indiana University Center for Postsecondary Research collected such paired reports as part of an experiment embedded in the 2006 administration of the NSSE, an annual mixed mode survey (paper and web) of students in their first and fourth years at four-year 1 NSSE data are used with permission from the Indiana University Center for Postsecondary Research. 2 Note that numeric reports might be susceptible to response error. 4

colleges and universities. The Indiana University Center for Postsecondary Research, the Indiana University Center for Survey Research, and the National Center for Higher Education Management Systems jointly implement the survey, which aims to be a nationally representative survey of undergraduate quality that can supplement college ranking instruments. We selected the NSSE data for this study because the survey is a well-established example of a questionnaire assessing behavioral frequencies using vague quantifiers. Furthermore, the 2006 experiment collected pairs of vaguely and numerically quantified responses the ideal data for this investigation. From January to June 2006, nearly 260,000 students from 523 colleges and universities participated in the NSSE. Note that the experiment was administered only in the survey s web version. Ultimately, 53,838 students from 144 U.S. colleges and universities participated in the experiment. For the current study, NSSE researchers shared experimental data for two-fifths of the experimental sample (10,767 students). In selecting the sample for the current study, we took care to ensure the selection of sufficient institution and within-institution sample sizes. Seventeen of the 144 institutions were not included in the current study s sample selection because they had an experimental within-institution sample size of fewer than 80. Ultimately, a random sample of students was selected from each of the 127 eligible institutions such that the proportion of observations per institution in the sample matched the proportion of observations per institution in the original experiment. The NSSE questionnaire assessed a variety of engagement constructs, including student-faculty interaction, level of academic challenge, and enriching educational experiences. Students reported how often during the current school year they had engaged in a variety of activities using the following response options: never, sometimes, often, and very often. Of particular interest in this study are the 12 items for which paired frequency data were collected (see Table I.1). For respondents in the experimental condition, these 12 items, along with the respondents initial vaguely quantified Table I.1. Experimental NSSE Item Descriptions Variable Name Questions Presentation Class Group Work Other Group Work Tutor Community Project Ideas with Others Grades Career Plans Ideas with Faculty Received Feedback Other Faculty Variable Label Asked questions in class or contributed to class discussions Made a class presentation Worked with other students on projects during class Worked with classmates outside of class to prepare class assignments Tutored or taught other students Participated in a community-based project Discussed ideas from your readings with others outside of class Discussed grades or assignments with an instructor Talked about career plans with a faculty member or advisor Discussed ideas from your readings or classes with faculty members outside of class Received prompt written or oral feedback from faculty on your academic performance Worked with faculty members on activities other than coursework 5

responses, appeared at the end of the survey. Respondents numerically quantified their vaguely quantified response using a time unit of their choosing (for example, per day, week, month, academic term, or year; see Figure I.1). Figure I.1. Example Experimental Question (from Nelson Laird et al. 2008) A primary goal of the NSSE is to measure aspects of student engagement. To that end, previous investigations have explored the correlational structure of relevant items and have proposed various structures underlying the data. For example, in a study of the psychometric properties of the NSSE, Kuh (2002) used principal component analysis to conclude that the first 22 items on the survey (college activity items) represent four components of student engagement: student-faculty interaction, student-student interaction, diversity, and class work. Additionally, NSSE has classified 42 of the survey s items, including 12 of the 22 college activity items, into five Benchmarks of Effective Educational Practice. Specifically, their active and collaborative learning and student-faculty interaction benchmarks are composed entirely of college activity items. Benchmark scores are computed by first transforming the vaguely quantified ordinal responses to a 100-point scale (never = 0, sometimes = 33.3, often = 66.7, and very often = 100) and then averaging the transformed responses across all component items. These benchmark scores are equivalent to unitweighted factor scores in which all component items are considered to be equally good measures of the underlying construct. In a recent investigation of vague quantifier interpretation in the NSSE, Nelson Laird et al. (2008) deferred to these predetermined benchmarks. Because previous conclusions about the underlying structure of the college activity items were based on analyses of the vaguely quantified frequency reports, it is possible that unaccounted for differences in vague quantifier interpretation have inadvertently affected those conclusions. Therefore, the current study first explores the relationships among the 12 numerically quantified experimental college activity items to identify an underlying structure not affected by potential differences in interpretation. 6

II. INFLUENCES OF DIFFERENTIAL INTERPRETATION ON MEASUREMENT MODELS A. Analytic Strategy Whether measurement models based on vaguely quantified behavioral frequency reports are affected by differential interpretation is an empirical question. To that end, we compared the extracted factor structures resulting from the analysis of the numerically quantified frequency reports with the vaguely quantified frequency reports. We began by defining a reference measurement model based on the numerically quantified reports. This model is meant to represent the true underlying factor structure of the behaviors under study. Next, we estimated measurement models based on the vaguely quantified reports under two common scenarios. The first mimicked a researcher defining a factor structure among all available items using a confirmatory approach when no validated factor has been defined in the literature. The second scenario mimicked a researcher defining a factor structure based on a factor and items that have already been validated in the literature. Furthermore, although vaguely quantified reports meet the assumptions of ordinal-level data, researchers commonly treat such data as interval level. Therefore, to make these analyses most relevant, we treated the vaguely quantified reports as both interval- and ordinal-level data in both scenarios. We assumed the numerically quantified reports (specifically, the natural log transformation of the numerically quantified reports) were continuous. Finally, we compared the reference model with the models based on the vaguely quantified reports to determine when, if ever, differences in vague quantifier interpretation result in a factor structure that differs from the reference model. Model selection procedures were similar across analyses. First, we examined correlations among the 12 items. Using a generous cutoff, we retained items having a correlation of at least 0.3 with at least one other item for the confirmatory factor analysis. Second, we compared model fit of a twodimensional model based on the NSSE benchmarks (Kuh 2002) with a one-dimensional model based on the same items. Third, if the NSSE benchmark model fit was unsatisfactory, we used substantive theory and local model fit statistics (for example, normalized residuals) to define an alternative two-dimensional model (often based on fewer items). Finally, we compared the model fit of the alternative two-dimensional model with a one-dimensional model based on the same items. B. RESULTS 1. Numerically Quantified Reports First, consider the numerically quantified frequency reports. A measurement model based on the numerically quantified reports is not susceptible to individual differences in interpretation as is a measurement model based on the vaguely quantified reports. Therefore, the measurement model derived from these reports was considered to be the reference model to which measurement models based on the vaguely quantified reports can be compared. Before the estimation of the reference model, it was necessary to obtain item correlations controlling for any effect of unit of time. To that end, we estimated all bivariate correlations with the effect of unit of time partialled out. As Table II.1 illustrates, 5 of the 12 items were not sufficiently correlated with any other items to be included in the initial confirmatory factor model (questions, presentations, class group work, other group work, and grades). According to the NSSE benchmarks (Kuh 2002), three of the remaining seven items were measures of active and collaborative learning, whereas the remaining four items were measures of student-faculty interaction. Therefore, we estimated a two-dimensional model based on 7

this structure using robust maximum likelihood estimation. Global model fit statistics indicated that model fit was acceptable although the two resulting factors were highly correlated (r = 0.946; see Table II.2), and so we estimated a one-dimensional model on the seven items. Global model fit of the one-dimensional model was slightly better than the two-dimensional model. Thus, based on the model fit statistics and the principle of parsimony, we selected the one-dimensional model estimated on seven items for the numerically quantified reports, controlling for the unit of time for which the numerically quantified report was provided. This model represents the reference model (RM). 2. Vaguely Quantified Reports Now consider the vaguely quantified frequency reports. We compared the factor structures a researcher might obtain from vaguely quantified reports with the reference model structure based on the numerically quantified reports. Any differences among the final selected models indicate inconsistencies in the measurement of either or both the numerically and vaguely quantified reports. We estimated measurement models based on the vaguely quantified reports under two common scenarios. The first mimicked a researcher examining the factor structure among all available items using a confirmatory approach (scenario one). The second mimicked a researcher examining the factor structure among only the available items identified by a reference model (that is, a previously validated model prevalent in the literature) (scenario two). Additionally, the vaguely quantified reports are treated as both interval- and ordinal-level data in both scenarios. a. Interval Treatment Consider first the common practice of assuming that the vaguely quantified reports meet the assumptions of interval-level measurement. As illustrated by the Pearson correlation matrix (see Table II.3), one of the 12 items was not sufficiently correlated with at least one other item (class group work) and thus was not included. Using the NSSE benchmarks as a guide, we estimated a two-dimensional model on the remaining 11 items using robust maximum likelihood estimation. Global model fit statistics indicated that model fit was acceptable, although the two resulting factors were highly correlated (r = 0.880; see Table II.4). Therefore, a one-dimensional model was estimated on the remaining items. Although global model fit was slightly worse than the two-dimensional model, fit was still acceptable. Therefore, based on the model fit statistics and the principle of parsimony, we selected the one-dimensional model estimated on 11 items (VQC) for the vaguely quantified reports assumed to be continuous. 8

Table II.1. Pearson Partial Correlation Matrix of Natural Log of Numerically Quantified Yearly Behavioral Frequencies (N = 8,174) 1 2 3 4 5 6 7 8 9 10 11 12 9 Questions (1) 1 Presentation (2).167 1 Class Group Work (3).141.232 1 Other Group Work (4).166.261.266 1 Tutor (5).104.131.114.195 1 Community Project (6).058.186.122.156.281 1 Ideas with Others (7).233.169.170.233.152.073 1 Grades (8).186.161.159.168.110.086.221 1 Career Plans (9).184.182.156.198.220.217.189.263 1 Ideas with Faculty (10).191.162.144.191.281.278.290.217.355 1 Received Feedback (11).214.160.177.189.095.057.323.248.246.228 1 Other Faculty (12).102.170.112.169.333.391.144.140.305.377.134 1 Table II.2. Confirmatory Factor Model Fit Indices: Numerically Quantified Reports Global Fit Indices Nestled Model Comparison Information Criteria Model RMSEA CFI Factor Correlation Model H 0 Log Likelihood Model H 0 Scale Factor # Free Parameters AIC BIC sabic 1 dimension.057.955-73,153.047 3.005 28 146,362.09 146,558.34 146,469.36 2 dimensions.058.954.946-73,148.558 3.005 29 146,355.12 146,558.37 146,466.21 Difference 2.988 3.005 Note: RMSEA.05 close approximate fit,.05 to.08 reasonable error of approximation,.10 poor fit (Browne and Cudeck 1993). CFI.90 reasonably good fit (Hu and Bentler 1999). AIC = Akaike information criterion; BIC = Bayesian information criterion; CFI = comparative fit index; RMSEA = root mean square error of approximation; sabic = sample-adjusted Bayesian information criterion.

Table II.3. Pearson Correlation Matrix of Vaguely Quantified Yearly Behavioral Frequencies (N = 8,174) 1 2 3 4 5 6 7 8 9 10 11 12 10 Questions (1) 1 Presentation (2).332 1 Class Group Work (3).151.270 1 Other Group Work (4).184.367.298 1 Tutor (5).237.209.120.256 1 Community Project (6).152.267.189.232.296 1 Ideas with Others (7).291.180.127.198.211.154 1 Grades (8).331.289.206.282.235.211.272 1 Career Plans (9).299.278.179.264.311.271.274.477 1 Ideas with Faculty (10).343.263.184.267.322.279.377.453.522 1 Received Feedback (11).281.236.190.189.192.188.277.333.323.348 1 Other Faculty (12).245.234.141.262.345.345.266.301.434.443.262 1 Table II.4. Confirmatory Factor Model Fit Indices: Vaguely Quantified Reports (continuous) Global Fit Indices Nestled Model Comparison Information Criteria Model RMSEA CFI Factor Correlation Model H 0 Log Likelihood Model H 0 Scale Factor # Free Parameters AIC BIC sabic 1 dimension.065.912-106,535.90 2.078 33 213,137.81 213,369.10 213,264.23 2 dimensions.061.924.880-106,413.26 2.056 34 212,894.52 213,132.82 213,024.77 Difference 184.427 1.330 Note: RMSEA.05 close approximate fit,.05 to.08 reasonable error of approximation,.10 poor fit (Browne and Cudeck 1993). CFI.90 reasonably good fit (Hu and Bentler 1999). AIC = Akaike information criterion; BIC = Bayesian information criterion; CFI = comparative fit index; RMSEA = root mean square error of approximation; sabic = sample-adjusted Bayesian information criterion.

Table II.5. Reference Model Fit Indices: Vaguely Quantified Reports (continuous) Global Fit Indices Nestled Model Comparison Information Criteria Model RMSEA CFI Factor Correlation Model H 0 Log Likelihood Model H 0 Scale Factor # Free Parameters AIC BIC sabic 1 dimension.056.962-68,701.386 2.016 21 137,444.77 137,591.96 137,525.22 2 dimensions.058.963.960-68,697.480 1.980 22 137,438.96 137,593.15 137,523.24 Difference 6.382 1.224 Note: RMSEA.05 close approximate fit,.05 to.08 reasonable error of approximation,.10 poor fit (Browne and Cudeck 1993). CFI.90 reasonably good fit (Hu and Bentler 1999). AIC = Akaike information criterion; BIC = Bayesian information criterion; CFI = comparative fit index; RMSEA = root mean square error of approximation; sabic = sample-adjusted Bayesian information criterion. 11 Table II.6. Polychoric Correlation Matrix of Vaguely Quantified Yearly Behavioral Frequencies (N = 8,174) 1 2 3 4 5 6 7 8 9 10 11 12 Questions (1) Presentation (2).390 Class Group Work (3).178.308 Other Group Work (4).219.420.338 Tutor (5).293.290.139.310 Community Project (6).195.335.231.284.364 Ideas with Others (7).345.209.145.231.257.186 Grades (8).392.336.238.325.280.260.318 Career Plans (9).352.318.204.303.363.332.317.542 Ideas with Faculty (10).414.309.209.314.386.340.442.528.594 Received Feedback (11).334.275.219.220.230.231.323.388.372.406 Other Faculty (12).306.284.166.319.417.432.319.362.509.522.316

Alternatively, we can analyze the vaguely quantified reports with respect to only the reference model items; doing so makes substantive factor definitions directly comparable. Recall that the reference model was a one-dimensional model based on 7 of the original 12 items (engagement: tutor, community project, ideas with others, career plans, ideas with faculty, feedback, and other faculty). As illustrated by the Pearson correlation matrix (see Table II.3), these seven items were sufficiently correlated with at least one other item. The one-dimensional reference model estimated on the vaguely quantified reports had acceptable model fit (see Table II.5). Although a twodimensional model based on the NSSE benchmarks also had acceptable global model fit, the resulting factors were highly correlated (r = 0.960). Therefore, imposing the factor structure of the reference model on the vaguely quantified reports treated as interval-level data (VQC-RM) resulted in acceptable model fit. b. Ordinal Treatment Now consider the less common practice of treating the vaguely quantified reports as ordinallevel data. As illustrated by the polychoric correlation matrix (see Table II.6), all 12 items were sufficiently correlated with at least 2 others. Using the NSSE benchmarks as a guide, we estimated a two-dimensional model on all 12 items using weighted least squares. Global model fit statistics indicated that model fit was acceptable; however, the two resulting factors were highly correlated (r = 0.873; see Table II.7). Estimation of a one-dimensional model on these same items resulted in acceptable global model fit. Therefore, based on the model fit statistics, we selected the onedimensional model estimated on all 12 items (VQO) for the vaguely quantified reports assumed to be ordinal. Alternatively, we analyzed the vaguely quantified reports with respect to only the reference model items. According to the polychoric correlation matrix estimated from the vaguely quantified reports, these 7 items were sufficiently correlated with at least one other item (see Table II.6). The one-dimensional reference model estimated on the vaguely quantified reports had acceptable model fit (see Table II.8). A two-dimensional model based on the NSSE benchmarks had slightly worse global model fit, and the resulting factors were highly correlated (r = 0.993). Therefore, imposing the factor structure of the reference model on the vaguely quantified reports treated as ordinal-level data resulted in acceptable model fit (VQO-RM). F. MODEL COMPARISONS Having defined a variety of measurement models under a variety of assumptions, we can now examine differences among them. Recall that the measurement model estimation considered two scenarios. In the first, a researcher used a confirmatory approach to reduce a set of items based on theory and empirical evidence. In the second, a researcher used and sought to replicate a reference measurement model already validated in the literature. In the current study, we considered both estimation scenarios and compared models based on numerically and vaguely quantified behavioral frequency reports. Differences in estimated factor structures indicate inconsistencies in the measurement of either or both the numerically and vaguely quantified reports. One potential source of such an inconsistency is the differential interpretation of vague quantifiers. A second source might be the unreliability of numerically quantified reports. 12

Table II.7. Confirmatory Factor Model Fit Indices: Vaguely Quantified Reports (ordinal) Indices Global Fit Nested Model Comparison Model RMSEA CFI 1 dimension.064.935 Factor Correlation 2 dimensions.060.944.873 # Free Parameters Difference 237.851 1 Note: RMSEA.05 close approximate fit,.05 to.08 reasonable error of approximation,.10 poor fit (Browne and Cudeck 1993). CFI.90 reasonably good fit (Hu and Bentler 1999). χ obtained using DIFFTEST option in Mplus. CFI = comparative fit index; RMSEA = root mean square error of approximation. Table II.8. Reference Model Fit Indices: Vaguely Quantified Reports (ordinal) Global Fit Indices Neste d Model Comparison Model RMSEA CFI 1 dimension.059.975 Factor Correlation 2 dimensions.062.974.993 # Free Parameters Difference 1.233 1 Note: RMSEA.05 close approximate fit,.05 to.08 reasonable error of approximation,.10 poor fit (Browne and Cudeck 1993). CFI.90 reasonably good fit (Hu and Bentler 1999). obtained using DIFFTEST option in Mplus. CFI = comparative fit index; RMSEA = root mean square error of approximation. 1. Scenario One First, consider the scenario in which a confirmatory approach was used to identify a measurement model for which no validated measurement model exists. As illustrated in Table II.9, all three models resulted in different factor structures varying in which and how many items were included. Of the 12 items, 7 were included in all three models (tutor, community project, ideas with others, career plans, ideas with faculty, feedback, and other faculty), indicating some consistency in the construct(s) being measured. To better understand the relationship between the reference model and the models based on the vaguely quantified data, correlations among the factors were obtained by either simultaneously estimating both sets of models or by first separating and then correlating predicted factor scores (because of estimation difficulties in the simultaneous models; see Table II.10). Recall that the reference model was a one-dimensional model measuring student engagement. The factors from both vaguely quantified models had high correlations with the student engagement factor from the reference model. Specifically, these factors explained between 50 and 66 percent of the same variance in behavioral frequency reports. The vaguely quantified reports treated as ordinallevel data had the highest correlation with the reference model. 13

Table II.9. Standardized Factor Loadings: Confirmatory Factor Model Comparisons Standardized Factor Loadings (SE) Item RM VQC VQO Engagement Engagement Engagement Questions Presentations Class Group Work Other Group Work Tutor Community Project Ideas with Others Grades Career Plans Ideas with Faculty Feedback Other Faculty Note:.326 (.014).381 (.016).191 (014).330 (.013).380 (.014).182 (.015).421 (.015).492 (.011).460 (.011).439 (.012).474 (.012).429 (.015).466 (.013).618 (.008).686 (.009).715 (.008).489 (.009).595 (.011).550 (.010).524 (.010).369 (.011).512 (.011).524 (.012).494 (.014).503 (.011).668 (.008).725 (.008).767 (.007).532 (.009).653 (.009) SE = standard error; RM = reference model; VQC = vaguely quantified continuous; VQO = vaguely quantified ordinal. Table II.10. Factor Correlations: Confirmatory Factor Model Comparisons Model Correlation SE VQC.706 VQO.811 RM Note: SE = standard error; RM = reference model; VQC = vaguely quantified continuous; VQO = vaguely quantified ordinal. 14

2. Scenario Two Next, consider the scenario in which a researcher seeks to replicate a validated measurement model presented in the literature. As illustrated in Table II.11, both models based on the vaguely quantified data successfully reproduced the desired reference model. The correlations between the factor estimated on the vaguely quantified reports treated as either interval- or ordinal-level data and the factor estimated by the reference model were approximately 0.53 (see Table II.12), indicating that the models explained approximately 28 percent of the same variance in behavioral frequency reports. Table II.11. Standardized Factor Loadings: Reference Model Comparisons Standardized Factor Loadings (SE) Item RM VQC-RM VQO-RM Engagement Engagement Engagement Tutor.326.481.535 (.014) (.013) (.012) Community Project.381.435.497 (.016) (.015) (.014) Ideas with Others.191.458.503 (.014) (.014) (.012) Career Plans.330.681.728 (.013) (.009) (.008) Ideas with Faculty.380.731.787 (.014) (.008) (.007) Feedback.182.463.507 (.015) (.009) (.009) Other Faculty.421.634.697 (.016) (.011) (.010) Note: SE = standard error; RM = reference model; VQC = vaguely quantified continuous; VQO = vaguely quantified ordinal. Table II.12. Factor Correlations: Reference Model Comparisons Correlation SE RM VQC-RM.524 VQO-RM.536 Note: SE = standard error; RM = reference model; VQC = vaguely quantified continuous; VQO = vaguely quantified ordinal. G. DISCUSSION This chapter demonstrated that measurement models based on vaguely quantified frequency reports do not always replicate measurement models based on numerically quantified data. Furthermore, even models based on the same vaguely quantified frequency reports differ depending on whether the data are treated as interval or ordinal level. Therefore, substantive latent measures of behavioral frequency depend on both the quantification of behavioral frequency reports (that is, numeric or vague) and assumptions made about the level of measurement obtained. 15

III. DETECTION OF DIFFERENTIAL INTERPRETATION BY MEASUREMENT MODELS One potential explanation for differences between measurement models based on numerically and vaguely quantified frequency reports is the individual variability in the interpretation of vague quantifiers. The response-style literature has presented various approaches to account for individual differences in the use of response scales when estimating latent variable models. The proposed methods vary depending on whether direct measures of the response style (for example, social desirability) are available and how the response style is conceptualized. For example, Maydeu- Olivares and Coffman (2006) illustrated the use of the random intercept item factor model to account for response-style differences when no direct measure of response style was available and when the response style was conceptualized as a single continuum on which all respondents were located. Conversely, Leite and Cooper (2010) illustrated the use of a factor mixture model to account for socially desirable responding when a direct measure of susceptibility to social desirability was available and when the response style was conceptualized as affecting only one of two latent classes of respondents. The purpose of this chapter is to determine whether the factor structure of the vaguely quantified frequency reports more closely resembles the reference model factor structure when models account for individual variability in vague quantifier interpretation. A. ANALYTIC STRATEGY We address whether latent variable models can be used to detect differential interpretation by identifying a continuous latent interpretation factor or categorical latent interpretation classes. First, we consider a random intercept item factor model in which differences in vague quantifier interpretation are modeled to occur along a continuum. Next, we consider a factor mixture model in which differences in interpretation are modeled to occur as categorical latent interpretation classes. The purpose of these analyses is to determine whether a substantive researcher can use any of the available latent variable models to extract a methodological artifact differential interpretation from measurement models intended to be purely substantive (see Leite and Cooper 2010). The random intercept item factor model and a factor mixture model are estimated on the vaguely quantified frequency reports. We begin by fitting the reference model factor structure to the vaguely quantified frequency reports. Next, we add either a random intercept or a latent class variable to the model. Successful detection of differences in vague quantifier interpretation is confirmed by improved model fit and interpretability of the model. B. RESULTS 1. Random Intercept Factor Model First, consider the random intercept item factor model. Before inclusion of a random intercept, the reference model estimated on the vaguely quantified reports treated as interval-level data had acceptable global model fit (see Table III.1). As illustrated in Table III.1, inclusion of a random intercept in this model resulted in deteriorated model fit. Similarly, the reference model estimated on the vaguely quantified reports treated as ordinal-level data had acceptable global model fit. As illustrated in Table III.2, inclusion of a random intercept in this model resulted in deteriorated model fit. Therefore, regardless of whether the vaguely quantified reports were treated as intervallevel or ordinal-level data, the random intercept did not improve model fit. 16

Table III.1. Random Intercept Item Factor Model Fit Indices: Vaguely Quantified Reports (continuous) Global Fit Indices Nested Model Comparison Information Criteria Model RMSEA CFI Model Ho Log Likelihood Ho Scale Factor # Free Parameters AIC BIC sabic Original.056.962-68,701.386 2.016 21 137,444.77 137,591.96 137,525.22 Random intercept.084.931-68,938.876 2.191 22 137,921.75 138,075.94 138,006.03 Difference 80.972 5.866 Note: RMSEA.05 close approximate fit,.05 to.08 reasonable error of approximation,.10 poor fit (Browne and Cudeck 1993). CFI.90 reasonably good fit (Hu and Bentler 1999). AIC = Akaike information criterion; BIC = Bayesian information criterion; CFI = comparative fit index; RMSEA = root mean square error of approximation; sabic = sample-adjusted Bayesian information criterion; VQC-GS = vaguely quantified continuous. 17

Table III.2. Random Intercept Item Factor Model Fit Indices: Vaguely Quantified Reports (ordinal) Global Fit Indices Nested Model Comparison Model RMSEA CFI Original.059.975 Random intercept.072.970 Difference Factor Correlation # Free Parameters Note: Note: RMSEA.05 close approximate fit,.05 to.08 reasonable error of approximation,.10 poor fit (Browne and Cudeck 1993). CFI.90 reasonably good fit (Hu and Bentler 1999). obtained using DIFFTEST option in Mplus. CFI = comparative fit index; RMSEA = root mean square error of approximation; VQO-GS = vaguely quantified ordinal. 2. Factor Mixture Model Second, consider the factor mixture model. Clark (2010) outlined the following steps for estimating such a model: 1. Estimate a latent class analysis model using the vaguely quantified reports. Identify the number of classes (T) from the best-fitting model using the Bayesian information criterion (BIC), Lo-Mendell-Rubin (LMR) p-value, bootstrapped likelihood ratio test (BLRT), and the substantive interpretation of the classes. 2. Estimate a factor mixture model based on the reference model factor structure. Increase the number of classes until a best-fitting model is obtained or T classes are estimated. Unfortunately, the BLRT has not yet been developed for use with complex survey data and thus model fit decisions will be based on the LMR p-value and the BIC. First, we estimated a series of latent class models on 7 of the original 12 items making up the reference model (tutor, community project, ideas with others, career plans, ideas with faculty, feedback, and other faculty). As illustrated in Table III.3, the three-class model fit the data best. Furthermore, after closely inspecting the conditional probabilities estimated by both models, we selected the three-class model as the best fitting model because it resulted in the most consistently interpretable classes. Specifically, the three-class model classified respondents as high-, medium-, or low-engagement students as illustrated in Table III.4. 18