Comprehensive Statistical Analysis of a Mathematics Placement Test

Comprehensive Statistical Analysis of a Mathematics Placement Test Robert J. Hall Department of Educational Psychology Texas A&M University, USA (bobhall@tamu.edu) Eunju Jung Department of Educational Psychology Texas A&M University, USA (doduli@tamu.edu) Michael S. Pilant Department of Mathematics Texas A&M University, USA (mpilant@tamu.edu) Background Abstract: As part of an NSF Science, Technology, Engineering and Mathematics Talent Expansion Project (STEP) grant, a math placement exam (MPE) has been developed at Texas A&M University for the purpose of evaluating the pre-calculus mathematical skills of entering students. It consists of 33 multiple-choice questions, and is taken online. A score of 22 or higher is required in order to register for Calculus I. Following admission to the University and prior to registering for classes, approximately 4,500 students take the placement exam each fall. To date, more than 15,000 students have taken the MPE. This paper focuses on the psychometric properties of the MPE based on analysis of test score results over a four-year period from 2008 to 2011. An item response theory (IRT) analysis has been performed. Various other statistical tests, such as a confirmatory factor analysis and the computation of Cronbach's alpha (α), have also been performed. The value for Cronbach's α (using the entire test sample) exceeds 0.90. This is comparable to high-stakes placement tests such as the SAT and the Advanced Placement tests. A detailed description of the statistical analyses performed on the MPE, as well as the results and their interpretation, will be presented in this paper. Finally, a brief analysis of the degree to which student performance in Calculus I can be predicted using MPE scores is presented. Texas A&M University has the second largest engineering program in the United States, with over 8,000 undergraduate engineering majors. Traditionally, students in this program take calculus during their freshman year, along with physics and other science, technology, engineering, and mathematics (STEM) courses. Some students are not sufficiently prepared and have difficulty passing their mathematics courses. In order to identify students with potential problems, a Math Placement Exam (MPE) was developed. Our goal was to develop a reliable, robust measure of the preparedness of incoming students for college-level calculus. Beginning in 2008, almost all entering freshmen were required to take one of two math placement exams, either for engineering calculus or for business math. More than 4,000 students took the exams in summer 2011 prior to entering Texas A&M, and 1,781 students were enrolled in Calculus I during the fall 2011 semester. As of January 2012, data (MPE scores and grades) for more than 15,000 students in the science and engineering programs has been collected. This study focuses on three research areas: 1. Internal consistency and robustness of the test instrument; 2. Number of latent variables present; and 3. Difficulty and discrimination of each test item.

Conceptual Design of the Mathematics Placement Exam (MPE) The MPE consists of 33 multiple choice items covering the following areas: polynomials, functions, graphing, exponentials and logarithms, and trigonometric functions. Questions were designed by two veteran faculty members who are experienced in teaching both pre-calculus and calculus. Each problem has 15 variants constructed from a template with different parameters. Consequently, each of the variants is of equal difficulty. When an exam is created, a question is selected and one of 15 variants of that question is delivered online to the student. Questions are delivered in the same order in every exam. This ensures that each exam is of equal difficulty. After the student selects a response, another question is delivered until all questions are completed. Students have an opportunity to review questions and answers before submitting the test. Following submission, questions are graded (correct or incorrect), and scores are returned to students. There are 15 N different versions of the MPE where N is the number of questions on the exam. Consequently, students each receive a unique set of questions on their version of the MPE. Based on cumulative performance data, a cutoff score of 22 was established to help ensure basic algebra and pre-calculus skills in Calculus I. Historically, more than 70% of students with placement scores of 22 or greater pass Calculus I. Passing is defined as receiving a grade of A, B, or C. Under current guidelines, a student must score 22 or higher on the MPE in order to enroll in Calculus I. If they score below 22, they make take the exam again (after waiting a month) or enroll in a summer program called the Personalized Pre-Calculus Program (PPP). Until fall 2011, students with less than 22 could still self-enroll in Calculus I, but as of fall 2011, the registration system blocks registration in Calculus I if their MPE score is less than 22. Statistical Analysis Reliability Under classical test theory, reliability is defined as the proportion of the true score variance over the observed score variance (Raykov & Marcoulides, 2011). Practically, we can define the reliability as the consistency of the test scores across different administrations or different populations. Among the many indexes for reliability, Cronbach s α is the most widely used, since internal consistency among test items can be measured with only a single test administration. Cronbach s α is calculated by taking the mean of all possible split-half coefficients which are computed using Rulon s method (Crocker & Algina, 1986). Given the entire data set (over 15,000 tests), the Cronbach's α coefficient value of the test items is 0.901, which indicates that the internal consistency among the items is very good. One of the important features of the Cronbach's α test is the re-calculation of α when one question is removed. Removing the question that reduces α the least gives a reduced set of questions with the maximal α. This can be done repeatedly until one has a minimal number of questions with an α above a certain level. The results of this procedure are shown below in Figure 1. Figure 1. Values of Cronbach s α Removing One Question at a Time from the MPE (n =15,128)

In general, good internal consistency amongst items on a test is indicated when the Cronbach s α coefficient is greater than 0.8. The MPE has Cronbach α's which exceed 0.8 for the combined year data (N=15,128) and for data separated by year (Table 1). Therefore, we can say that the MPE has very good internal consistency. Construct Validity Year Number of Students Mean Difficulty Standard Deviation Cronbach s α 2008 3792 20.83 7.245 0.898 2009 3530 21.66 7.004 0.895 2010 3735 21.97 7.244 0.905 2011 4071 22.34 7.081 0.902 Table 1. Values of Cronbach s α for the MPE by Year Under the traditional confirmatory factor analysis (CFA) model, only continuous observed variables can be handled. Often, however, questions are scored dichotomously, correct or incorrect. Advances in CFA modeling allow researchers to test factor structures using measures made up of dichotomous items. The CFA model for categorical data does not use the item scores themselves, but depends on the assumption of an underlying, normally distributed variable behind each discrete item or instrument component (Raykov & Marcoulides, 2011). Confirmatory factor analysis has been widely used to test factor structures underlying tests or instruments with multiple items. For example, cognitive tests are usually designed to measure one or more theoretical latent constructs, such as math proficiency, reading comprehension, or problem-solving skills. The primary function of CFA models is to relate observed variables with unobserved latent constructs. CFA allows researchers to test hypotheses about a particular factor structure by examining both the dimensionality (i.e., the number of underlying factors) of the test and the pattern of relations between the items and factors (Brown, 2006). To evaluate the construct validity of the MPE, various categorical data analysis models were tested using the software package Mplus6 (Muthén & Muthén, 1998-2010). Confirmatory factor analysis (CFA) of categorical variables uses the concept of latent variables. These variables are not directly observable but are assumed to give rise to the categorical responses. For the MPE, observed variables are scores on the algebra test items and the latent construct is thought to be higher-order algebra skills. Even though the observed scores of our data are dichotomous, the latent variables are thought to be continuous and assumed to be normally distributed. The estimation method used is weighted least squares. Fundamentally, the CFA model under the MPlus6 Program is equivalent to the two parameter item response theory (2-P IRT) model. For this analysis, we chose to try to fit a two parameter normal-ogive model as opposed to a two parameter Logistic model (2-PL). The normal-ogive model is mathematically more difficult to fit but provides a more robust, direct interpretation than does the logistic regression model. Mplus6 provides indicators of overall model fit and parameter estimates (i.e., difficulty and discrimination) for the two parameter IRT model. Ultimately, however, we would like to be able to fit the more restrictive one parameter model as we are primarily interested in whether dimensionality (i.e., one factor under the set of items) is plausible or not. We want to establish that the MPE can be characterized by a single latent higher-order algebra factor. We begin by fitting the two parameter, less restrictive model, and then proceed to examining fit using a one parameter model. The overall model fit for a two factor model can be evaluated by computing a chi-square statistic and accompanying fit indices, root mean squared error of approximation (RMSEA) and comparative fit index (CFI). The items of the MPE were fitted to the two factor model across years and by year. Results are shown in Table 2. Although the chi-square values were statistically significant (indicating that our data did not fit a two parameter model across years or within year), the MPE data samples were large, and large samples often result in significant chi-square values (Brown, 2006). In such cases, it is both necessary and appropriate to check other fit statistics, such as the CFI and RMSEA indices, before judging model fit. Conceptually, the comparative fit index (CFI) measures the distance of the proposed model from a bad model, one that does not pose any relationship amongst variables (i.e., covariances among all input indicators are fixed to zero). Hence, a larger CFI value, 0.95 or above, is deemed acceptable. For the MPE, the combined year data had a CFI of 0.979. Individual year data had CFIs of 0.977, 0.979, 0.982, and 0.980, respectively. CFIs, therefore, offer a compelling argument for a two parameter model. The RMSEA measures the discrepancy between the model and the observed data; therefore, smaller values of the RMSEA are better. Typically, an RMSEA less than 0.05 is thought to represent a good fit between the proposed model and the observed data. For the MPE, the RMSEA value for the combined years was 0.026 while RMSEA values for years 2008, 2009, 2010, and 2011 were

0.027, 0.025, 0.024, and 0.025, respectively. All RMSEAs were small, indicating very good model fit. To summarize, CFA analyses were interpreted to indicate a good fit for a two factor (one latent variable) model whether data were combined into one large group or separated into year cohorts. To further characterize this model, we turned to item response theory (IRT). Year N χ2 df CFI RMSEA 2008 3792 1714.982 464 0.977 0.027 2009 3530 1482.488 464 0.979 0.025 2010 3735 1475.073 464 0.982 0.024 2011 4071 1640.535 464 0.980 0.025 Total 15128 5186.291 464 0.979 0.026 Table 2. Uni-dimensionality Confirmatory Factor Analysis Results (Two Parameter Model) Item Response Theory (IRT) Analysis Item Response Theory is called a latent trait model since it uses a mathematical model that relates a theorized latent construct (or trait) with the observed item response using item parameters (Hableton, et al., 1991). In order to fit observed data to an IRT model, two assumptions should be met. First, the dimensionality of the items (i.e., the number of latent variables) should be confirmed. Second, the items should not be correlated after taking account for the latent factor (or factors). This assumption is called local independence. Based on the CFA analysis, we can say that the uni-dimensionality of the complete set of MPE items is supported. If there is a big modification index, suggesting correlation between unique factor scores under the CFA analysis, the local independence assumption is considered to be violated. In our analysis, there was no evidence for violation of the local independence assumptions based on review of the modification indices which did not suggest adding any error covariances to reduce the overall χ 2. Therefore, the use of a uni-dimensional, single latent variable IRT model is justified. Often, cognitive test items are scored as dichotomous (i.e., correct or incorrect). The IRT model allows for analysis of dichotomous data by theorizing a latent trait with the probability of the correct response for each item. Basically, the dichotomous IRT model is an S-shaped curve. The horizontal axis represents the values of the latent trait scaled to have a mean of zero, and the vertical axis represents the probability of making a correct response (see Figure 2). Figure 2. Item Characteristic Curve: Dichotomous IRT Model

The IRT model mathematically captures item difficulty and the individual's ability on the same continuum. For this reason, we can compare the item difficulty and individual s ability level directly. There are three generally accepted dichotomous IRT models. The first IRT model (one parameter) postulates a single parameter, item difficulty, as the underlying latent variable. This corresponds to equation (1) where c i =0 and the a i are the same for each question. This is closely related to the classical Rasch model (Rasch, 1961). In the context of a person s ability, measurement of the parameter implies that one needs more ability to endorse or pass a more difficult item. In Figure 2, the x-value corresponding to the 0.5 probability of passing an item indicates the difficulty parameter for that item. The second IRT model involves two parameters and is referred to as the (2-PL) model. In this model, an attempt is made to fit two parameters to the observed data, item discrimination and difficulty. The discrimination parameter represents how well the item differentiates individuals according to their ability level. In other words, the slope of the item response curve where the difficulty level locates is the discrimination parameter. This corresponds to equation (1) where c i = 0 for each question. The third IRT model is a three-parameter (3-PL) model that allows the parameter c i to vary in addition to the difficulty and discrimination parameters. It is called a pseudo-guessing parameter and represents the probability of passing an item with very low ability level. The three parameter IRT model can be written as the following equation: c i + (1 c i ) e a i( θ b i) 1 +e a i( θ b i) (1) In equation (1), the parameter b i represents the difficulty parameter of the i th item. The parameter a i represents the discrimination of the i th item. The parameter c i represents the pseudo-guessing parameter of the i th item. As the skill (latent trait) increases, the probability of getting the correct response goes to 1. As the skill decreases, the probability asymptotically approaches that of guessing. The effect of guessing is reduced as the value of the latent variable (skill) increases. In the case of the MPE, we chose not to fit a three parameter model. Conceptually, the pseudo-guessing parameter is problematic because it does not take into account differential option attractiveness; thus the random guessing model s assumption is not reflected in the response data. Given potential problems with the pseudoguessing parameter, de Ayala (2009) suggests that a two parameter model may provide a sufficiently reasonable representation of the data (p. 126). To handle random guessing, we based our analyses on protocols with scores of 7 or higher. On the 33-item MPE, with each question having 5 possible responses, random guessing could result in scores of 6 or below. MPlus6 was used to fit both one and two parameter IRT models to the MPE data. Data were fitted to both models as a whole (i.e., combining years) and by year. MPlus6 also allows us to fit the one and two parameter IRT models using both normal-ogive (inverse normal cumulative distribution) and logistic functions. In this analysis, we used the normal-ogive function. First, the two parameter model which is equal to the confirmatory factor analysis model previously tested was found to be feasible with only one dimensionality. Next, a one parameter IRT model was tested by fixing all the factor loadings to be the same. Table 3 presents the results of the one parameter IRT analyses of data combined across years and separated by year. The combined data (N = 15,128) has a good fit with indices of 0.954 for CFI and 0.037 for RMSEA. However, for the two parameter model, the data by year fit is not as good. All the RMSEAs indicate good fit while all the CFIs indicate adequate fit. Conversely, the one parameter model showed acceptable fit statistics for the combined-year as well as for the by-year MPE data. Year N χ2 df CFI RMSEA 2008 3792 4919.470 495 0.918 0.049 2009 3530 4533.435 495 0.915 0.048 2010 3735 4678.547 495 0.926 0.048 2011 4071 5008.461 495 0.921 0.047 Total 15128 10611.034 495 0.954 0.040 Table 3. Confirmatory Factor Analysis Results (i.e., the One Parameter IRT model)

In summary, further analyses regarding the difficulty and discrimination parameters will follow depending upon need, but at this point we are confident that the MPE instrument is psychometrically sound, measuring a single latent variable sensitive to how well items differentiate individuals according to their ability level. This is fundamentally true whether we look at the group as a whole or break the group into year-based cohorts. Predicting Student Performance Using the MPE Given that the outcome variable is categorical (i.e., A, B, C, D, F, or Pass/Fail), measurement of the Pearson correlation coefficient is not a good indicator of the relationship between grades and MPE scores. In the case of our data, for example, the resulting Pearson correlation coefficient r 2 values ranged between 0.15 and 0.20. To better understand the relationship between MPE scores and grades, we switched to an Odd s Ratio or probability analysis. Using historical data, we can compute the frequency of students passing the course with MPE scores in a given range. This cumulative frequency distribution function can be modeled very accurately by a two parameter logistic function. Consequently, instead of using grades as the outcome variable, we compute the probability of passing with MPE scores in a given range. The r 2 value between the output of this model and the actual historical data exceeds 0.98. Using this model, we can identify an MPE score such that 70% of the students with this score or higher pass Calculus I. The MPE cutoff score of 22 originated from this analysis. MPE Scoring Across Years Using a logistics curve fitting model, it was determined that a cutoff score of 22 for the 33 item MPE instrument could be used to indicate who would be successful in the beginning calculus class, Calculus 1. The MPE instrument, psychometrically, does a good job of discriminating performance on one latent variable (measuring calculus readiness or pre-calculus ability ) across item difficulty levels. The instrument was designed to measure the latent variable, higher order algebraic ability, and cutoff scores of 22 indicate a probability of about 0.7 that students majoring in engineering or the sciences will be successful in the gateway calculus class, Calculus 1. The following analyses look at the relationship between student grades in Calculus 1 to measured ability on the Math Placement Exam. Table 4 provides descriptive statistics for the MPE by year for 15,160 students. Average scores for the MPE are similar across years but there is a noticeable upward (creeping) trend in the scores. An ANOVA with Year as the independent variable and MPE score as the dependent variable produced a statistically significant main effect for year (F (3, 15,156 df) = 32.44; p <.000). Subsequent post hoc testing (Tukey HSD and Bonferroni) indicated that the mean MPE score for 2008 was lower than the mean score for any other year. The mean MPE scores for 2009 and 2010 grouped together (no difference between means) as did the mean scores for 2010 and 2011. Overall, there is evidence that might be interpreted to suggest that over time MPE scores are increasing. What does this mean for using a firm cutoff point for the MPE? First, even though the cohort differences appear to be getting larger, the largest cohort mean difference is only 1.5 questions. Although the difference is statistically significant, it may not be meaningfully significant. Meaningful significance is addressed through the concept of explained or common variance, the effect size. Partial eta squared for the year main effect was 0.006 or 0.6 of one percent. Knowing a student s MPE score tells us nothing about the student s cohort year. An artifact of large samples is small standard error values making small mean differences statistically significant. Year Mean N S.D. 2008 20.83 3792 7.24 2009 21.66 3530 7.00 2010 21.97 3735 7.24 2011 22.37 4103 7.07 Total 21.72 a 15160 7.16 a Note. At the time of this analysis, grades for 32 additional students were paired with MPE scores increasing the n from 15,128 to 15,160 Table 4. Descriptive Statistics for MPE by Year

Despite the fact that the mean increases are small and not of major concern there is a trend upward. The creep in scoring, however, makes sense given the evolution of the instrument. Initially, in 2008, students were asked to complete the MPE before beginning their freshman year. At that time, the MPE was in a development phase. Since nothing was at stake, students tended not to take the MPE more than one time. As the level of MPE scoring has become tied to placement in freshman math classes, students are now more aware that low scores on the MPE will prevent them from taking Calculus 1 as an entering freshman. Since the highest score on repeated administrations of the MPE is the officially registered MPE value, we might expect some upward movement like that seen in the current data set. Students are also made aware, through academic counseling, that poor performance on the MPE reflects relative weakness in a skill area, complex algebra, which is related to poor performance in the 4-course engineering math sequence. Students taking time to refresh or instantiate their algebraic skills may improve their performance on the MPE, thus contributing to the slight elevation in overall performance observed in the data set. Summary and Conclusions The purpose of IRT and Rasch modeling is to provide a framework for evaluating how well assessments work and how well individual items on assessments work (Embretson & Reise, 2000). Typically they are used in conjunction with test development, not to test whether assessments are psychometrically sound following development. We built the MPE using subject-matter experts, then used Rasch and IRT modeling to validate the MPE, thus lending support to the product development process. By performing a CFA analysis of the MPE data, we found that we could model the MPE data using a unidimensional, single latent variable model. From this information, we were justified in applying an item response theory analysis which supported a two parameter model. RMSEA and CFI values confirmed the fit of the two parameter model. In addition, we looked at the internal consistency of the 33-item MPE and found Cronbach s α values of approximately 0.90. This was true both cumulatively and by year. This indicates the high internal consistency of the MPE placement test both longitudinally and cumulatively. Finally, if the relationship between grades in Calculus I and MPE scores is expressed as the probability of passing Calculus I given an MPE score range, the result is an accurate prediction regarding the success (retention rate) of how students typically perform in Calculus I. In summary, results from these analyses, Cronbach s α, CFA, and IRT, attest to the psychometric soundness of the instrument for this sample of students. Moreover, the relationship between grades and MPE scores suggests that knowledge of performance on the MPE can be used to predict the probability that a student will experience difficulties in the first year engineering calculus sequence. We recognize, however, that what may be a good instrument for measuring the pre-calculus mathematical skills for students entering this particular university may not work well for measuring math readiness placement skills at other institutions due to differing student populations. The MPE developed at Texas A&M University for a particular engineering mathematics course provides some measure of confidence that one can develop placement exams with similar statistical properties for other courses at other institutions. References Brown, T. A. (2006). Confirmatory factor analysis for applied research. New York, NY: The Guilford Press. Crocker, L. & Algina, J. (1986). Introduction to classical and modern test theory. Fort Worth, TX: Harcourt College. de Ayala, R. J. (2009). The theory and practice of item response theory. New York, NY: The Guilford Press. Embretson, S. E. and Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum Associates. Hableton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage.

Muthén, L.K. and Muthén, B.O. (1998-2010). MPlus User s Guide. Sixth Edition. Los Angeles, CA: Muthén & Muthén. Rasch, G. (1961). On general laws and the meaning of measurement in psychology. In, Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, IV (pp. 321-334). Berkeley, CA: University of California Press. Raykov, T. & Marcoulides, G. A. (2011). Introduction to Psychometric Theory. New York,NY: Routledge. Acknowledgements Research supported in part by a grant from the National Science Foundation, NSF-DUE#0856767. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the National Science Foundation.