A Rasch Analysis of the Statistical Anxiety Rating Scale

Size: px

Start display at page:

Download "A Rasch Analysis of the Statistical Anxiety Rating Scale"

Elizabeth Chase
5 years ago
Views:

1 University of Wyoming From the SelectedWorks of Eric D Teman, J.D., Ph.D A Rasch Analysis of the Statistical Anxiety Rating Scale Eric D Teman, Ph.D., University of Wyoming Available at:

2 JOURNAL OF APPLIED MEASUREMENT, 14(4), Copyright 2013 A Rasch Analysis of the Statistical Anxiety Rating Scale Eric D. Teman University of Northern Colorado The conceptualization of a distinct construct known as statistics anxiety has led to the development of numerous rating scales, including the Statistical Anxiety Rating Scale (STARS), designed to assess levels of statistics anxiety. In the current study, the STARS was administered to a sample of 423 undergraduate and graduate students from a midsized, western United States university. The Rasch measurement rating scale model was used to analyze scores from the STARS. Misfitting items were removed from the analysis. In general, items from the six subscales represented a broad range of abilities, with the major exception being a lack of items at the lower extremes of the subscales. Additionally, a differential item functioning (DIF) analysis was performed across sex and student classification. Several items displayed DIF, which indicates subgroups may ascribe different meanings to those items. The paper concludes with several recommendations for researchers considering using the STARS. Requests for reprints should be sent to Eric D. Teman, University of Northern Colorado, Applied Statistics and Research Methods, McKee 518, Campus Box 124, Greeley, CO 80639, USA, eric.teman@unco.edu.

3 A Rasch Analysis of the Stars 415 Statistics anxiety exists as a construct separate from but related to mathematics anxiety (Baloğlu, 2004). Those who experience statistics anxiety are not an isolated few; rather, research shows that statistics anxiety is a widespread phenomenon (Onwuegbuzie, 2004). The issue of statistics anxiety has been researched for several decades (Earley and Mertler, 2002). In fact, numerous instruments have been developed with the sole purpose of measuring levels of statistics anxiety. One of the most popular of such instruments is the Statistical Anxiety Rating Scale (STARS; Cruise, Cash, and Bolton, 1985), shown in Appendices A and B. In the original psychometric research completed on the STARS by Cruise, et al. (1985), six components of statistics anxiety were identified using principal components analysis, which comprise the worth of statistics, interpretation anxiety, test and class anxiety, computational self-concept, fear of asking for help, and fear of statistics teachers. Though much research on the psychometric properties of the STARS is present in the literature, such research is limited to applications of classical test theory (CTT), exploratory factor analysis (EFA), and principal components analysis (PCA). There is currently no research applying the Rasch model to scores from the STARS. Research applying Rasch measurement techniques is warranted to address some of the limitations of the true score model approach to item analysis. Specifically, the Rasch model was used in the current study to facilitate a more in-depth item-level analyses of scores from the STARS. Such analyses were accomplished by examining fit statistics, which allow for the assessment of the validity of the scores on the overall measure. Misfitting items were identified in the analyses and removed from the model. The goal was to maintain only the items that functioned well in identifying the underlying traits encompassing statistics anxiety. In addition to assessing fit, Rasch modeling is useful in examining the participants use of the response scale (e.g., strongly disagree to strongly agree). A 5-point Likert-type rating scale was used in the current study. In such a rating scale analysis, researchers hope to see the formation of a continuum, where, at one extreme are the participants who have less of the trait, and at the other extreme are the participants who have more of the trait (Green, 2002). In other words, those who indicated a 1 (strongly disagree) on an item should possess less statistics anxiety than someone who marked a 2. Finally, differential item functioning (DIF) can be examined for its effects on the validity of the scores obtained from a measure. One of the assumptions in Rasch measurement is parameter invariance (Fischer and Molenaar, 1995). Essentially, invariance describes the scope of use properties of a measure (Green, 2002, p. 4). When parameter invariance fails across subpopulations in a sample, the uses of the measurement instrument are limited. In the current study, a DIF analysis was conducted using Mantel (1963) for polytomies. For each of the items that displayed DIF, for one or both of the subpopulations of sex and student classification, the implications are discussed. Statistics Anxiety Literature Review Prior estimates have shown that statistics anxiety is experienced by 80% of graduate students at uncomfortable levels (Onwuegbuzie, 2004). Of notable concern is that performance in a statistics class and statistics anxiety levels are inversely related (Fitzgerald and Jurs, 1996; Onwuegbuzie and Seaman, 1995; Zeidner, 1991). Students should be able to readily interpret statistical findings in journal articles and other scholarly publications they read, but statistics anxiety inhibits this ability and the ability to apply statistical techniques to real world situations (Birenbaum and Eylath, 1994). Statistics anxiety level is negatively correlated with skills acquisition (Mji, 2009). That is, students who experience high levels of statistics anxiety might not grasp the statistics skills they are being taught as well as students who do not experience this type of anxiety. Another concern

4 416 Teman is that students having high statistics anxiety wait until the end of their academic careers before enrolling in their required statistics class (Onwuegbuzie, DaRos, and Ryan, 1997), thereby missing the opportunity to actively apply statistical research skills during their academic career. Research has shown that factors potentially influencing statistics anxiety include sex, age, ethnicity, academic major, and previous mathematics experiences (Baloğlu, 2003; Onwuegbuzie and Wilson, 2003). Of these factors, sex seems to be the most widely studied. Some prior research has indicated that women experience higher levels of statistics anxiety (Onwuegbuzie, 1993; Royse and Romph, 1992; Zeidner, 1991), but other research has indicated no significant differences exist between sexes (Baloğlu, 2003). In terms of student classification, there is some support that statistics anxiety levels are higher among graduate students than for undergraduate students (Benson and Bandalos, 1989; Harvey, Plake, and Wise, 1985). On the other hand, Benson (1989) discovered that statistics anxiety did not differ significantly between undergraduate and graduate students. Dimensionality of Statistics Anxiety There is much support in the literature indicating that statistics anxiety is comprised of six dimensions. Baloğlu (2002) was the first to confirm the six-factor model of the STARS using a sample of 221 undergraduate college students enrolled in a basic statistics course. Hanna, Shevlin, and Dempster (2008) ran confirmatory factor analyses, testing a one-factor model, a four-factor model, and a six-factor model of the STARS with 849 undergraduate psychology students in the United Kingdom. Descriptive fit indices, including root mean square error of approximation (RMSEA), comparative fit index (CFI), and standardized root mean square residual (SRMR) indicated the one-factor model fit the data poorly. Both the four- and six-factor models exhibited reasonable model fit, with the six-factor model fitting the data the best. Rasch Measurement Researchers in various fields of study utilize surveys as a means to collect data from respondents on their feelings or attitudes on some construct. Typically, these surveys consist of rating scales where, for example, individuals respond to items on an agreement scale designed to measure attitude toward or agreement with some construct. The rating scale can consist of many ordered categories, such as a rating scale with five response options, from strongly disagree, at one extreme, to strongly agree at the opposite extreme. This particular method of data collection results in polytomous data. A traditional approach to assessing reliability of scores from rating scales often includes reliability and item analysis based on CTT. EFA is a common technique used in conjunction with CTT, to help assess the validity of scores from the rating scale. A PCA was used by the authors of the original STARS instrument, where a six component model was interpreted (Cruise et al., 1985). Under a CTT approach, item-total statistics can be examined to determine which (if any) items should be deleted from the instrument. For example, the Cronbach s-alpha-if-item-deleted value, as reported in SPSS, can be examined to see if there would be an appreciable increase in the internal consistency reliability if a particular item were removed from the instrument. Other statistics such as inter-item correlations and itemtotal correlations can be examined as well. The reasons for performing a Rasch analysis on the STARS data were threefold: first, it was used to create a scale that fits the Rasch model, which is accomplished by deleting poorly fitting items from the model and reassessing model fit. One advantage of having data that fits the Rasch model well relates to the invariance property of the model. Bond and Fox (2007) stated that, It is a prima facie requirement of measurement outside the social sciences that values attributed to variables by any measurement system should be independent of the particular measurement instrument used (p. 69). In other words, in the

5 A Rasch Analysis of the Stars 417 realm of the social sciences, when we seek to measure constructs, our measurement instrument should not affect what we are measuring. The Rasch model is the only model that claims to be truly invariant if the data fit the model (Bond and Fox, 2007). Therefore, it is desirable to have data that meet the requirements of the Rasch model. Second, Rasch analysis was used to assess the quality of the 5-point rating scales for each of the six subscales. Third, and perhaps most importantly, Rasch analysis was used to determine which items were tapping into the varying levels of each of the six dimensions of statistics anxiety. When measuring any trait or construct, it is desirable to have items that are operating at each of the different ability levels (e.g., high, medium, low). This issue is discussed for each of the six subscales and recommendations are provided on how to address the problems that follow. Another related issue that arises with most any rating scale is whether the individual items (or the entire rating scale) function differently across different subpopulations. Though this issue is not unique to Rasch analysis, it is addressed within the context of a Rasch analysis in the current study. In the context of Rasch modeling, DIF is assessed on an item-by-item basis (e.g., de Ayala, 2009; Bond and Fox, 2007; Glas and Verhelst, 1995). One advantage of using the Rasch model over CTT is the comprehensiveness of the itemlevel analysis. Goodness-of-fit criteria are used to identify items that do not adequately fit the Rasch model. The deletion of these items from the model will improve model fit (Bond and Fox, 2007). In addition, a representative sample of the examinee population is unnecessary because of the invariance property of the Rasch model, which essentially means parameters that characterize an examinee are independent of the test items from which they are calibrated and those parameters that characterize an item are independent of the ability distribution of the set of examinees (Hambleton and Jones, 1993, p. 258). In CTT, however, a representative sample is needed because the results of any one CTT analysis are sample-specific. In both CTT and Rasch measurement, a heterogeneous adequate sample is needed for the analysis to be valid (Hambleton and Jones, 1993). In addition to the item-level information produced in a Rasch analysis, the researcher is able to assess the quality of the rating scale. For example, for a 5-point rating scale, there are diagnostic procedures that can be used to assess how well the rating scale actually worked. This provides information on whether or not adjacent categories might need to be collapsed to help eliminate noise in the rating scale (Bond and Fox, 2007). Although this can be done in a CTT framework by examining frequencies in each category and then making a subjective decision on how well the rating scale is operating, a Rasch analysis provides fit statistics and threshold calibrations that aid in the decision. Maybe most importantly, the Rasch model provides information on how well the items on a particular rating scale tap into the various levels of the trait or construct being measured (Bond and Fox, 2007; King and Bond, 1996). If the items are only measuring persons with middle ability levels, for example, the researcher is going to be provided imprecise measurements for persons with high or low levels of the trait or construct being measured. If, on the other hand, there are items distributed across all ability levels, the scale is doing well at precisely measuring the ability levels persons at all levels of the measured trait. Andrich Rating-scale Model The Rasch model is a single parameter logistic model, where the parameter is the location on the continuum of the item on the latent trait being measured (Green, 2002). The usual one-parameter model is appropriate only for dichotomous data. However, when a researcher wishes to apply the Rasch model to rating scale data with polytomous response options, a special case of the Rasch model can be applied. Specifically, the Andrich rating-scale model can be used (Andrich, 1978). The Andrich rating-scale model can be expressed as a logit-linear model:

6 418 Teman æ P ö log nij = Bn-Di-Fj, ç P (1) çè ni ( j-1) ø where P nij is the probability that person n encountering item i is observed in category j, B n is the ability measure of person n, D i is the difficulty measure of item i, F j is the calibration threshold between category j and j 1 (Linacre, 2011). When the Andrich rating-scale model is used, Rasch analysis provides item estimates for responses to each Likert-type stem and estimates for the response thresholds, e.g., four thresholds on a 5-point rating scale (Bond and Fox, 2007; Wright and Masters, 1982). The principal advantage of estimating item thresholds is that each item can be examined for its usefulness in differentiating among respondents with differing levels of the construct of interest. That is, by examining thresholds, it becomes clear that the items do not have the same relative value in the construct being examined. For example, item 1 may work well for differentiating between higher levels of the construct, whereas item 2 may work better for differentiating between lower levels of the construct. The researcher will readily be able to discern whether or not the items are at the appropriate difficulty level for the group. In the context of rating scale data, difficulty acquires a different meaning; the more difficult the item, the harder it is for a person to endorse that item. It is most desirable for the items on the scale to include an appropriate range of item stems that actually tap into the range of levels of the construct for our given population. Examining item thresholds when developing a scale for a survey or test allows us to ensure there are items adequately discriminating between persons across the range of statistics anxiety levels. Differential Item Functioning When a survey or test is administered to a group of individuals, it is sometimes of interest to investigate whether persons in subpopulations (typically identifiable from a demographics section of the survey) exhibit differences in item difficulty when ability is held constant. Ideally, results of a test or survey will not be affected by subpopulation membership; however, this is typically an untenable assumption (Fischer and Molenaar, 1995). There will likely at some level be a group within the larger population against whom the test is biased. It is the researcher s task to identify these potential groups so that DIF analyses can be conducted prior to making any sweeping generalizations to any one population. This is not to say that every conceivable subpopulation should be examined for DIF. Rather, the researcher should begin by conducting a thorough literature review and becoming familiar with the theory underlying the test or survey (Myers, Wolfe, Feltz, and Penfield, 2006). For example, if prior research has established that one of the sexes tends to display DIF for a survey item, it is to the researcher s advantage to investigate whether DIF occurs in his or her particular study before making any generalizing conclusions. In the current study of the psychometric properties of the STARS instrument, all 51 items were rating scale items with five response categories (polytomous). A polytomous item s biasedness is determined by first holding ability constant and then examining the differences between groups on the probabilities of scoring in the various categories of that item (Crane, van Belle, and Larson, 2004; Fischer and Molenaar, 1995; Penfield, 2000). It is worth noting that when a subpopulation exhibits DIF in the context of a survey, the meaning is interpreted differently than when dealing with a test. If there were DIF in a test, it may be the case that one group is outperforming another group. In a survey, however, it is more meaningful to say that one group more easily endorses an item than the other group. As stated above, a DIF analysis is generally performed to compare estimates across two or more groups, such as males and females or graduate and undergraduate students to determine if items significantly differ in meaning across the distinct groups (Bond and Fox, 2007). There is no consensus on whether or not males and females differ in terms of their statistical anxiety levels (e.g., Benson, 1989; Benson and Bandalos, 1989; Harvey et al., 1985; Onwuegbuzie, 1993; Onwuegbuzie et al., 2000; Royse and Romph,

7 A Rasch Analysis of the Stars ). Although research has been conducted on statistics anxiety levels across different age groups finding older students typically experience higher levels of statistics anxiety (e.g., Benson and Bandalos, 1989; Harvey et al., 1985), prior research is not conclusive on whether or not undergraduate and graduate students differ in their level of statistics anxiety (e.g., de Ayala, 2009; Bell, 2003). As a result of this gap in the literature, differences in statistics anxiety among undergraduate and graduate students were of interest in the current study. The research on statistics anxiety is fairly extensive, but there are weaknesses in the prior research. With the STARS being one of the most popular instruments for measuring statistics anxiety, there is a lack of psychometric research indicating whether the STARS is measuring statistics anxiety equivalently for all subpopulations. Thus, it appears that previous studies have been published under the assumption that the STARS measures statistics anxiety equivalently for all students. This assumption may be incorrect. Therefore, the rationale for the DIF analysis is to determine whether the STARS items function equivalently across different groups of students that are typically compared in the literature. The DIF analyses undertaken in the current study involve subpopulations that have been compared in prior research. In such subpopulations anxiety levels were determined to either significantly differ between the groups or not significantly differ, sometimes with conflicting results depending on the population under study. However, the results from these prior studies could be weakened if DIF was present. Participants Method Students at a midsized university in the western part of the United States were recruited to complete the STARS instrument and a demographics questionnaire. There were 431 participants (294 women, M age = 26.58, 136 men, M age = 24.94). Both undergraduate students (127 freshmen, 64 sophomores, 51 juniors, 23 seniors) and graduate students (53 masters, 113 doctoral) were recruited to take the survey. Students across various academic disciplines were represented in the sample, including behavioral sciences (n = 95), business (n = 14), education (n = 42), health sciences (n = 80), humanities (n = 10), mathematics (n = 32), natural sciences (n = 31), nursing (n = 82), performing and visual arts (n = 6), social sciences (n = 31), and other (n = 8). Eight participants were removed from the analysis because they did not provide adequate demographic information. The majority of undergraduate students completing the STARS were from my introductory statistics classes as well as my colleagues introductory statistics classes. The remaining undergraduate students were from introductory psychology classes. All of the graduate students were enrolled in a masters or doctoral program within the college of education, in majors such as statistics, educational psychology, counseling psychological, special education, and educational leadership. The response rate was approximated to be near 80 percent. Instrumentation The STARS instrument comprised 51 items (each rated on a 5-point Likert-type scale). For the first 23 items (see Appendix A), participants were asked to rate their anxiety level (from no anxiety to strong anxiety) for each of the conditions. For the next 28 items (see Appendix B), participants were asked to rate their level of agreement (from strongly agree to strongly disagree) with each statement. On items 1-23, higher scores on each item correspond to higher anxiety levels; and for items 24-51, higher scores on each item correspond to more positive attitudes. A series of demographic questions was included at the end of the survey, inquiring about sex, age, academic major, student classification (e.g., freshman, sophomore). Procedure Prior to data collection, an application for exempt review was submitted to and approved by the institutional review board at the university where the study was conducted. All participants

8 420 Teman were given a consent form. Students were given the option to complete a paper version of the survey or to complete it online. The online survey format was also used as a follow-up method for collecting data. In the end, 50 participants out of 423 completed the paper-based survey. A $20 cash prize drawing was used as an incentive to increase the response rate. The entry form was attached at the end of the survey packet for those who took the paper version. If a participant were interested in entering the drawing, he or she would detach the entry form page and submit it separately from the survey. For those students completing the online version, a separate URL was given where they could enter the $20 drawing so their survey was not linked to the prize drawing. Results Dimensionality of the STARS The six-factor structure of the STARS instrument has been established across a number of studies (e.g., Baloğlu, 2002; Cruise et al., 1985; Hanna et al., 2008); therefore, before subjecting the data to a Rasch analysis, a confirmatory factor analysis (CFA) was conducted to determine whether or not the six-factor model was supported by the current data. A six-factor CFA model was specified and estimated using Mplus (version 6.1; Muthén and Muthén, 2010). The weighted least squares mean and variance adjusted (WLSMV) estimator was used, as it takes into account the ordinal nature of the data when producing parameter estimates. Global model fit and component fit were assessed using several measures of fit, including the chi-square test based on the WLSMV estimator (Muthén and Muthén, 2010), root mean squared error of approximation (RMSEA; Steiger, 1990), Tucker-Lewis index (TLI; Tucker and Lewis, 1973), and the comparative fit index (CFI; Bentler, 1990). The six-factor model in the current study was well fitting. The chi-square test statistic was statistically significant (p <.001), c 2 = (1209, N = 423), but this is to be expected with such a large sample size and is not necessarily indicative of poor model fit (Kline, 2011). In fact, the descriptive fit indices in the Mplus output provided evidence against rejection of the six-factor model. The RMSEA (a badnessof-fit index ) was.059, where 0 indicates perfect fit. The CFI and TLI were both.94, where values close to 1 indicate good fit. These fit statistics were compared to Hu and Bentler s (1999) recommendations for cutoff criteria. In addition, all standardized factor loadings were statistically significant (p <.001) and indicative of good component fit. Because of the six-factor structure of the STARS, the item analysis was conducted separately for each of the six dimensions. Rasch Item Analysis On the anxiety scale, the components identified from the literature were test and class anxiety, interpretation anxiety, and fear of asking for help. On the agreement scale, the components identified were worth of statistics, computational selfconcept, and fear of statistics teacher. All items and all persons on each of the six subscales were subjected to calibration using Rasch analysis. Specifically, WINSTEPS 3.71 (Linacre, 2011) was used to apply the Andrich rating-scale model (Andrich, 1978). For each of the six subscales, unidimensionality was assessed by running a Rasch principal components analysis (PCA) of the residuals within WINSTEPS. By running a Rasch PCA of residuals, researchers are searching for evidence of a component that explains a large amount of variance in the residuals. This component is known as the first contrast. Generally, to conclude the presence of unidimensionality, the first contrast should have an eigenvalue less than 2 (Linacre, 2011). Person and item separation reliability are discussed in each of the subsections below. Person separation reliability measures how accurately persons can be differentiated on the measured variable, whereas item separation reliability refers to how well the test distinguishes between items along the measured variable (Bond and Fox, 2007). These reliability coefficients can be interpreted in the same manner as a Cronbach s alpha.

9 A Rasch Analysis of the Stars 421 Additionally, as part of the item-level analysis, item-person maps (Wright maps) were generated for each of the six subscales of the STARS (Figures 1 to 6). These maps show the distribution of persons along the logit scale. Each of these six maps is discussed in further detail in each of the following subsections. Item Misfit Items were examined on each of the six subscales for both infit and outfit. Infit is an information-weighted sum whereas outfit is calculated using the conventional sum of squared standardized residuals (Bond and Fox, 2007). Both fit mean squares and standardized fit statistics (ZSTD) were examined to assess how well the data fit the model. Specifically, mean square values outside the range of.6 to 1.4 and ZSTD values outside of 2 to 2 were flagged as potential misfitting items (Bond and Fox, 2007). Items were deleted one-at-a-time, meaning the data were recalibrated before deleting another item. Test and class anxiety subscale. Person separation reliability for this eight-item subscale was.83 and item separation reliability was.99. None of the items met the mean square criterion for deletion, which is not unexpected given the moderate sample size of 423. However, items were also subject to removal based on standardized fit statistics, which accounts for sample size. In this case, item 22 Going over a final examination in statistics after it has been marked was removed as it had a ZSTD of 4.1 for outfit. Item 4 Doing the homework for a statistics course was deleted with a ZSTD of 3.9 for outfit. Item 10 Walking into the classroom to take a statistics test was deleted with a ZSTD of 3.4 for outfit. With five items remaining on this subscale, 63.1% of the raw variance was explained by the Rasch dimension (34.4% by persons and 28.7% by items), with the first contrast having a value of 1.6 (11.6% of the variance unexplained). In the final model, person separation reliability was.77, which is a decrease from the initial model; however, the nature of the misfit warranted deletion of several items. Item separation reliability remained at.99. Figure 1 shows the item-person map for the test and class anxiety subscale. The five items on the modified subscale cover a large part of the range of statistics anxiety. However, there is a lack of items at both extremes of the distribution. Interpretation anxiety subscale. Person separation reliability for this 11-item subscale was.84 and item separation reliability was.99. All 11 items were within the.6 to 1.4 mean square fit boundaries. Item 18 Watching a student search through a load of computer printouts from his/ MEASURE PERSON - MAP - ITEM - Expected score zones (Rasch-half-point thresholds) <more> 4.# +.# T+.##### ####### ##### ST ###### 8.45.######## ########### S ######### ############ M+M.##########.########## ####### S.######## -1 + ########## #### S T. ##### ##### #### 1.15 T ## -4.## + <less> EACH # IS 3. EACH. IS 1 TO 2 Figure 1. Item-person map for test and class anxiety subscale.

10 422 Teman her research was removed (outfit ZSTD of 4.3), and then the data were recalibrated and item 17 Trying to understand the odds in a lottery was removed (outfit ZSTD of 4.3). Item 9 Reading an advertisement for an automobile which includes figures on gas mileage, compliance with population regulations, etc. was removed on the basis of a large mean square infit value (1.53). Item 14 Determining whether to reject or retain the null hypothesis had an outfit ZSTD of 4.2 and was therefore removed. Item 12 Arranging to have a body of data put into the computer was removed on the same basis, which produced an outfit ZSTD of 4.1. Six remaining items comprise the modified interpretation anxiety subscale. With five items deleted from the model, 59.2% of the raw variance was explained by the Rasch dimension (35.5% by persons and 23.7% by items), with the first contrast having a value of 1.4 (9.5% of the variance unexplained). In the final model, person separation reliability was.83 and item separation reliability remained at.99. The six items comprising the interpretation anxiety subscale appear to be doing well in discriminating among persons at the middle and high end of the scale (see Figure 2). However, there are noticeable places on the person-item map where items are needed to discriminate between anxiety levels. Fear of asking for help subscale. Person separation reliability for this subscale was.73 and item separation reliability was.92. This subscale has only 4 items, with all items contained within the mean square and ZSTD fit boundaries. The Rasch dimension accounts for 59.9% of the raw variance (40.2% persons and 19.7% items). The first contrast has a value of 2.0 (20.3% of the variance unexplained). Figure 3 shows the person-item map for the four items making up the seeking help subscale. There is a complete lack of items detecting low and extremely low levels of anxiety when seeking help. An absence of items is also noticeable toward the higher end of the anxiety scale. Worth of statistics subscale. This subscale consisted of 16 items with person separation reliability of.89 and item separation reliability of.98. Two of the items were removed based on large mean square outfit values: item 36 Statistics is for people who have a natural leaning toward math and item 24 I am a subjective person, so the objectivity of statistics is inappropriate for me. The ZSTD outfit values were 2.54 and 2.14, respectively. The remaining 14 items were within MEASURE PERSON - MAP - ITEM - Expected score zones (Rasch-half-point thresholds) <more> ## # T ### 2 +.### 6.35.## T.###### S ######## ######## S 5.35.####### 0 ########### +M 6.25.########## ########### M S ########### ############ T ########.##### ###### S 7.25.####### #### ## T #### -5.### + <less> EACH # IS 3. EACH. IS 1 TO 2 Figure 2. Item-person map for interpretation anxiety subscale.

11 A Rasch Analysis of the Stars 423 the mean square boundaries for both outfit and infit. Several of the remaining items had unacceptable ZSTD values and were removed from the model: item 47 Statistical figures are not fit for human consumption, item 26 I wonder why I have to do all these things in statistics when in actual life I will never use them, item 50 I am never going to use statistics so why should MEASURE PERSON - MAP - ITEM - Expected score zones (Rasch-half-point thresholds) <more> # T ## ##.## ### S T 3.35.##### S 0.###### +M.##### S T #### M ####### +.########## -2.###### S ##### ####### T -4.############ + <less> EACH # IS 5. EACH. IS 1 TO 4 Figure 3. Item-person map for fear of asking for help subscale. I have to take it?, item 29 I feel statistics is a waste, item 33 I lived this long without knowing statistics, why should I learn it now?, and item 42 I do not see why I have to fill my head with statistics. It will have no use in my career. These items had outfit ZSTD scores of 3.8, 4.3, 4.6, 3.7, 3.6, and 3.6, respectively. Upon rerunning the model, items 41 I do not understand why someone in my field needs statistics and 49 Affective skills are so important in my (future) profession that I do not want to clutter my thinking with something as cognitive as statistics were removed based on large infit ZSTD scores of 3.6 and 3.6, respectively. With 6 of the original 16 items remaining in the model, 63.1% of the raw variance was explained by the Rasch dimension (40% person and 23.1% item). The first contrast had a value of 1.7 (10.4% unexplained variance). In the final model, person separation reliability was.82 and item separation reliability was.98. The six items on the worth of statistics subscale appear to be successfully discriminating among persons with mid to high levels of agreement (Figure 4). However, as occurred for all of the subscales on the anxiety portion of the survey, items detecting low and extremely low levels of the trait are nonexistent. Fear of statistics teacher subscale. Person separation reliability was.69 and item separation reliability was.99. All five items on this subscale were within the.6 to 1.4 mean square boundaries; however, item 32 Most statistics teachers are not human was removed for having a large infit ZSTD value of 3.6. After rerunning the analysis, person separation reliability was.68 and item separation reliability remained at.99. The Rasch dimension explained 57.7% of the raw variance (30% persons, 27.7% items). The first contrast was 1.4 (15.2% unexplained variance). The item-person map for the statistics teacher subscale is depicted in Figure 5. The five items comprising this scale do well in discriminating among persons with mid to high levels of agreement. There are, however, no items detecting extremely low levels of agreement. Computational self-concept subscale. Person separation reliability was.74 and item separation

12 424 Teman reliability was.99. All 7 items were within the mean square fit boundaries, but item 48 Statistics is not really bad. It is just too mathematical was removed on the basis of having a large ZSTD outfit value of 3.3. With 6 items remaining in the model, person separation reliability was.71 and item separation reliability remained at.99. The MEASURE PERSON - MAP - ITEM - Expected score zones (Rasch-half-point thresholds) <more> # T # # ##.## ### S+T.##.#### S #### #### +M ##.###### MS.## ### T ##### #### #### #### S.## ##### ### T -4 ########### + <less> EACH # IS 5. EACH. IS 1 TO 4 Figure 4. Item-person map for worth of statistics subscale. Rasch dimension accounted for 57.1% of the raw variance (28.2% persons, 28.9% items). The first contrast had a value of 1.8 (12.7% unexplained raw variance). Figure 6 shows the item-person map for the self-concept subscale. The six items do well at discriminating at the mid to high levels of agreement; however, there are no items discriminating among the low and extremely low levels of agreement. MEASURE PERSON - MAP - ITEM - Expected score zones (Rasch-half-point thresholds) <more> # T T # ## S.## ##.#### S ##### +M ##### ####### ##### S ######### M ######## ############ T -2 ########### + S ######## ####### T -4.####### + <less> EACH # IS 4. EACH. IS 1 TO 3 Figure 5. Item-person map for fear of statistics teacher subscale.

13 A Rasch Analysis of the Stars 425 Subscale Thresholds For items on each of the six subscales in the current study, item thresholds and category probability curves indicated that all five categories were being utilized (Andrich, 1988; Linacre, 1999, 2002). Specifically, category frequencies were examined to determine the distribution of MEASURE PERSON - MAP - ITEM - Expected score zones (Rasch-half-point thresholds) <more> T # T # ##.## S.## S.### ## +M ### ### ##### S.###.###### M #### T.#### ##### ##### +. S ####### ####### T -4.########## + <less> EACH # IS 5. EACH. IS 1 TO 4 Figure 6. Item-person map for computational selfconcept subscale. responses across each category. The average measure of ability at each category was also examined to ensure that the average measure increased with each category, indicating a higher ability at the higher levels of anxiety or agreement. Item thresholds for the six subscales are shown in Table 1. The frequencies and average measures of ability for each category on each of the six subscales are summarized in Tables 2 to 7. By viewing the observed count column, it can be seen that all five categories on the rating scale are being utilized well and there is no need to consider collapsing categories. The observed average column indicates that as anxiety or agreement increases, so does the average ability (Linacre, 1995). For example, at category 1 for the test anxiety subscale, (Table 2) the observed average is 1.97, which can be interpreted as the average ability estimate, or logit score, for persons who chose category 1 No anxiety on any item on the test anxiety subscale. These average measures are expected to increase in magnitude as statistics anxiety increases, which is true for this subscale and all other subscales of the STARS. The thresholds and category fit statistics should also be examined to verify how well the categories of the rating scale operated. The thresholds are the estimated difficulties for selecting one response category over another (e.g., the difficulty associated with endorsing category 1 over category 2; Wright and Masters, 1982). It has been shown that thresholds should increase by at least 1.4 logits but not beyond 5 logits between categories (Bond and Fox, 2007; Linacre, 1999). Upon examining the structure calibration column in Table 2, it can be seen that the rating scale here does not quite meet that criteria, suggesting the 5-point rating scale is not operating optimally. However, other criteria can be examined before concluding that there is a problem with the rating scale. Fit statistics can be examined as well to help determine the quality of the rating scale. Any infit or outfit statistics greater than 2 may be cause for concern (Linacre, 1999); however, that was not the case with any of the subscales in the current study.

14 426 Teman Finally, category probability curves were used to assess the effectiveness of the 5-point rating scale. Given that all categories are being utilized well, fit statistics are within acceptable ranges, and the threshold estimates are acceptable, it appears that the quality of the STARS rating scales is good. Differential Item Functioning As part of the investigation of the quality of the STARS instrument, I compared the estimates across males and females and across undergraduates and graduates. Specifically, I performed a DIF analysis on each of the six subscales using WINSTEPS (Linacre, 2011). In the Table 1 Rating Scale Item Thresholds Subscale Thresholds Test and class anxiety Interpretation anxiety Fear of asking for help Worth of statistics Fear of statistics teacher Computational self concept Table 2 Rating Scale Category Diagnostics for Test and Class Anxiety Subscale Category Observed Count Observed Average Infit MNSQ Outfit MNSQ Structure Calibration None Table 3 Rating Scale Category Diagnostics for Interpretation Anxiety Subscale Category Observed Count Observed Average Infit MNSQ Outfit MNSQ Structure Calibration None Table 4 Rating Scale Category Diagnostics for Fear of Asking for Help Subscale Category Observed Count Observed Average Infit MNSQ Outfit MNSQ Structure Calibration None

15 A Rasch Analysis of the Stars 427 Table 5 Rating Scale Category Diagnostics for Worth of Statistics Subscale Category Observed Count Observed Average Infit MNSQ Outfit MNSQ Structure Calibration None Table 6 Rating Scale Category Diagnostics for Statistics Teacher Subscale Category Observed Count Observed Average Infit MNSQ Outfit MNSQ Structure Calibration None Table 7 Rating Scale Category Diagnostics for Computational Self-Concept Subscale Category Observed Count Observed Average Infit MNSQ Outfit MNSQ Structure Calibration None case of rating scale data, DIF is indicative of one group s more easily endorsing an item after controlling for ability (Bond and Fox, 2007; Fischer and Molenaar, 1995). The criteria used to determine whether or not DIF existed was (1) DIF contrast >.5 logits and (2) DIF values that were statistically significant as indicated by t > 2 (Linacre, 2011). In a DIF analysis, the hypothesis is that a particular item has the same difficulty for two groups once ability is controlled. Figures 7 to 10 show the DIF graphs undergraduate students versus graduate students as indicated in each subsection below. DIF did not exist between the sexes on any subscale items. Test and class anxiety and Fear of asking for help subscales. There was no evidence of DIF for these subscales. Interpretation anxiety subscale. In Figure 7, it can be seen that graduates and undergraduates did differ on items 5 (DIF =.51, t(365) = 3.66, p =.0003) and 7 (DIF =.88, t(360) = 6.09, p <.001) with ability held constant. Graduate students more easily endorsed item 7, while undergraduates more easily endorsed item 5 with constant ability. It makes sense that graduate students would more easily endorse item 7 Trying to decide which analysis is appropriate for my research project, because they are more likely to conduct their own research for theses or dissertations, whereas undergraduates are not likely to engage in such research. Item 5 relates to making objective decisions based on empirical data, an activity graduate students are more likely to engage in with greater frequency. It is unclear why undergraduates would more easily endorse item 5. Worth of statistics subscale. Graduate students and undergraduates displayed DIF (Figure 8) for item 40 I wish the statistics requirement would be removed from my academic program

16 428 Teman (DIF =.61, t(266) = 3.93, p =.0001) with undergraduates more easily endorsing the items. It seems intuitive that graduate students would see the value in statistics courses more so than undergraduates. Fear of statistics teacher subscale. Item 43 Statistics teachers speak a different language, does display DIF (Figure 9) across undergraduates and graduate students (DIF =.54, t(351) = 4.42, p <.001), with graduate students more easily endorsing the item. It is possible that because graduate students typically have to take more statistics courses than undergraduates, as the level of the course increases so does the difficulty of the language of statistics. Figure 7. Person DIF Plot Interpretation Subscale Figure 8. Person DIF Plot Worth Subscale

17 A Rasch Analysis of the Stars 429 Self-concept subscale. Items 25 (DIF =.91, t(309) = 7.39, p <.001) and 34 (DIF =.75, t(295) = 5.59, p <.001) display DIF (Figure 10) across graduate students and undergraduates. It makes sense for graduate students to more easily endorse item 25 I have not had math for a long time. I know I will have problems getting through statistics, given that graduates probably have not had a math course in a while. Item 34 relates to enjoying math and is more easily endorsed by undergraduate students. Discussion Results of the CFA in the current study suggest that a multidimensional model of statistical Figure 9. Person DIF Plot Statistics Teacher Subscale Figure 10. Person DIF Plot Self-Concept Subscale

18 430 Teman anxiety, comprised of test anxiety, interpretation anxiety, anxiety when seeking help, worth of statistics, fear of statistics teacher, and computational self-concept, appears to be supported by the data. This is congruent with prior results published on the STARS (e.g., Baloğlu, 2002; Hanna et al., 2008). Therefore, in adhering to the six dimensions of statistics anxiety discussed in the literature and confirmed in the current study, a separate Rasch analysis was performed on each of the six subscales of the STARS instrument. Unidimensionality was assessed for each of the six subscales. In addition, the quality of the 5-point Likert-type rating scale was assessed by examining the step calibrations, category fit statistics, and category probability curves. In considering all quality indicators simultaneously, it appears that all five categories were adequately used. In following the principles of Rasch analysis as implemented in WINSTEPS, 20 of the original 51 items were deleted from the STARS instrument. The items on each subscale appear to discriminate well across the continuum of ability levels, with the principal exception being a general absence of items in the lower ability range. In terms of assessing DIF of each item across sex and student classification subpopulations, certain items displayed DIF for student classification only. This is an important consideration when interpreting statistics anxiety levels for various subpopulations. When items function differently for different subpopulations, this could mean the items have different meanings that are dependent upon some group classification. Other possible explanations for DIF might include contextual variables, such as wording of the items or format of the survey. The existence of DIF does not necessarily invalidate the measure, but the use of the measure may be limited in terms of generalizing across groups. It is important for the researcher to keep in mind that mean comparisons, for instance, between subpopulations may not be valid when DIF is present. If mean comparisons are desired between groups, any items that do exhibit DIF should first be removed before proceeding with the comparison (Myers et al., 2006). As an illustration, suppose an instructor conducted an independent samples t test of interpretation anxiety scores across undergraduate and graduate students and the results indicated undergraduate students exhibited more anxiety than graduate students. The instructor may conclude some sort of intervention is needed for the undergraduate students. However, if DIF is present in one or more of the items, the instructor s intervention could be unwarranted. Because DIF is present, the undergraduate students might be interpreting one or more of the items in a different way than the graduate students and any comparison of the scores across groups is meaningless. That is, the intervention is unsubstantiated by the data. Conclusions and Implications The current study has detailed the use of the Andrich rating-scale model within the context of Rasch analysis to study and interpret the responses of students to the statistics anxiety rating scale (STARS) to determine whether or not the items in each of the six subscales were measuring the same construct. The results of the analysis indicated which items within each subscale did not fit the Rasch model. After deleting items from each of the six subscales, it is now assumed that each subscale is a more valid measure of the underlying construct (with the exception of the seeking help subscale, as no items were deleted). Therefore, the reduced 31-item STARS instrument may be more useful. In particular, the administration time of the new instrument would be substantially reduced as well as the time required for data entry. The original STARS was used (as it appeared in 1985). It is recommended that some of the misfitting items be reworded. There are several double-barreled items (e.g., Statistics is not really bad. It is just too mathematical ), as well as an overabundance of negatively worded items (e.g., I feel statistics is a waste ). By altering some of the wording, it is possible that some of the misfitting items would adequately discriminate among a broader range of abilities. In addition, new items could be written to attempt to differentiate

Validating Measures of Self Control via Rasch Measurement. Jonathan Hasford Department of Marketing, University of Kentucky

Validating Measures of Self Control via Rasch Measurement Jonathan Hasford Department of Marketing, University of Kentucky Kelly D. Bradley Department of Educational Policy Studies & Evaluation, University