Motivational Effects of Praise in Response Time Based Feedback: A Follow up Study of the Effort Monitoring CBT 1

Motivational Effects of Praise in Response Time Based Feedback: A Follow up Study of the Effort Monitoring CBT 1 Xiaojing J. Kong, Steven L. Wise, J. Christine Harmes, Sheng Ta Yang James Madison University The lack of examinees test taking effort is a threat to test score validity. Previous approaches (e.g., monetary incentives) to elicit strong effort from examinees have shown limited effectiveness. Recently, Wise, Bhola, and Yang (2006) developed the Effort Monitoring Computer Based Testing (EMCBT), in which item response times are used to measure examinee motivation and warnings are displayed to those exhibiting a low level of engagement. In this follow up study, we further investigated the usefulness of EMCBT in facilitating motivation and performance by adding a second experimental condition in which praise was given to examinees who consistently exhibited solution behavior. The analysis results indicate that average test scores and examinee effort were significantly higher in the warning condition than in the control condition, while praise messages did not appear to significantly influence either performance or test taking effort. The findings of this study further our understanding of the motivational effects of EMCBT and its best use in maintaining examinee engagement throughout the test. Students performance on achievement tests can be affected by the level of effort that they expend on the test. If a person does not give reasonable effort on a test, it is likely that the test score would be under representative to his or her true level of proficiency. The lack of test taking effort is a particularly serious problem in low stakes testing, for which assessment results, though important to educators, researchers and policy makers, may have no personal consequences for the examinees. Many researchers have raised concerns associated with the validity of inferences and actions made on the basis of test scores obtained under low stakes conditions (Wainer, 1993; Smith, 1999; Sundre, & Kitsantas, 2004; Wise & DeMars, 2005). Combining results from 15 studies, Wise and DeMars (2005) found that the average difference between low motivated examinees and their more motivated counterparts was over one half of a standard deviation. Given the importance of assessment results to test givers (such as schools, institutions, policy makers, and taxpayers), the magnitude of this effect size is practically meaningful and deserves greater attention. Increasing student effort, and thus increasing performance, has been the focus of several research studies. One approach was to raise the stakes of testing (Wolf & Smith, 1995; Wolf, Smith, & Birnbaum, 1995). According to these studies, a general finding was 1 Paper presented at the 2006 annual meeting of the National Council on Measurement in Education, San Francisco. Correspondence regarding this paper should be addressed to Xiaojing Jadie Kong, Center for Assessment and Research Studies, James Madison University, MSC 6806, Harrisonburg, VA 22807. Email: kongxx@jmu.edu

Effects of EM Praise 2 that both student test performance and self reported levels of effort were higher under consequential (i.e., high stakes) conditions than under non consequential (i.e., low stakes) conditions. However, the motivational effects of raising test stakes have been debatable, due to the undesirable consequences associated with the nature of high stakes tests. In a high stakes environment, for instance, the level of anxiety may be increased, which could weaken the validity of test scores. It has been well documented that highly anxious examinees tend to score lower on tests (Hembree, 1988; Wolf & Smith, 1995). Furthermore, raising the test stakes is not always practically feasible due to such issues as students negative attitudes toward high stakes tests, the public acceptance level, and resources needed for developing and administering high stakes test items. Another notable approach was the use of financial rewards. O Neil, Sugrue, and Baker (1996) manipulated various incentives (e.g., money, tasks, and instructions) and found that only the money incentive increased performance and reported effort in 8 th graders. However, the offer of money did not impact either the performance or reported effort of the 12 th graders. More recently, O Neil et al. (2005) reported further investigations. They provided an increased amount of money ($10 per item correct) for 12 th graders, who were administered on the Third International Mathematics and Science Study (TIMSS) math items. The researchers found that the incentive group did show a higher level of test tasking effort, but did not perform better than the non incentive group. Unlike others, the study conducted by German researchers, Baumert and Demmrich (2001), indicated that financial rewards did not improve either performance or reported effort in 9 th graders. In addition to the financial rewards, there are also other interesting approaches to coax examinees to work harder on low stakes tests, such as changing test modalities (S. L., Wise, et al., 2004), and providing test score feedback to examinees (V. L., Wise, 2004). None of these studies found significant differences among the experimental groups, either for test performance or test taking motivation, indicating that more effective approaches are needed. Recently, Wise, Bhola, and Yang (2006) developed Effort Monitoring Computer Based Testing (EMCBT), in which warnings were displayed to examinees who exhibited a low level of engagement during a computerized test. In this experimental study, examinees were randomly assigned to either a warning condition or a no warning condition. The researchers then compared the two groups on the test scores and examinee effort. Their study yielded some promising results, as both test taking effort and performance improved for examinees under the warning condition. The authors proposed that the validity of the test scores was enhanced because (a) performance was higher for the warning group, which was attributable to a higher level of effort; (b) variance in observed test scores was lower for the warning group, due to the reduction in constructirrelevant variance (CIV); and (c) correlations of test performance with external criteria (i.e., SAT scores) were stronger for the warning group, providing evidence of stronger convergent validity.

Effects of EM Praise 3 A point worth noting here is that most studies previously described used examinee self reports to assess test taking motivation and the level of examinee effort, while Wise et al. (2006) employed examinees response times as indicators of test taking effort in a low stakes context. Compared to self reports, the use of response time information has important advantages. First, response time represents a direct observation of examinee behavior. It does not rely on examinee judgments, which may be biased by motivational processes. Moreover, data can be collected in an unobtrusive and non reactive way, if the examinees are unaware that response time data are being recorded. In addition, data availability at the item level permits investigations of dynamic changes in the level of effort during a testing session. In Wise et al. (2006), response time effort (RTE), was used to quantify examinee effort. The concept of RTE developed as a new measure of examinee motivation refers to the overall effort that examinees spent on a series of items (for details, see S. L., Wise, & Kong, 2005). RTE has been applied in a number of studies (Kong, Bhola, & Wise, in press; S. L., Wise, in press; S. L., Wise, & DeMars, 2006). In the present study, we further investigated the usefulness of EMCBT in facilitating motivation and performance by adding a second experimental condition in which positive feedback was given to examinees who put forth good effort as indicated by their response times. The strategy of actively seeking a correct answer to an item is referred to as solution behavior (Schnipke, 1995; Schnipke & Scrams, 1997). These examinees received praise messages in an attempt to encourage them to continue their solution behavior. On the other hand, when test taking motivation is rather low, the student is more likely to rapidly respond to the items. This type of test taking strategy is referred to as rapid guessing behavior (Schnipke, 1995; Schnipke & Scrams, 1997). Rapid guessing behavior can be identified by response times that are too short to allow the examinee to fully consider the item. Praise messages, therefore, were expected to reduce the occurrences of such rapid guessing behavior. A review of the research literature suggests that there have not been any studies on the impact of positive feedback on test taking behavior. Related research (e.g., Covington, 1998; Hancock, 2000, 2002) concerns the effects of verbal praise and other forms of external reinforcement on students learning motivation. The influence of praise has been broadly addressed in many currently dominant theories of motivation, such as goal orientation theory, attribution theory, and self efficacy theory (Silverman & Casazza, 2000). The lack of research on adult learners and mixed findings regarding students reactions to praise prompted the current investigation. The objective of this study was to explore how praise messages could reinforce desirable test taking behavior in a computer based test (CBT). The primary research questions to be addressed were: (1) Does Effort Monitoring (EM) praise feedback impact students test taking effort and their test performance? (2) Can the effects of EM warning found in Wise et al. (2006) be replicated in the present study?

Effects of EM Praise 4 Participants Method The examinees used in this study were 718 sophomores at a medium sized southeastern university who received computer based assessment tests during the university s spring Assessment day in February, 2006. The sophomores were required to participate in university s Assessment day. They were assigned to testing groups on the basis of the last two digits of their student identification numbers. These groups were administered different combinations of tests assessing the university s general education program and student development. Some students received paper and pencil tests, while others received computer based tests. Those students who had missed the Assessment day for legitimate reasons were required to take makeup tests, which were primarily the same as the computer based tests administered during the Assessment day. In our study, 575 (80%) examinees were those who attended the makeup testing sessions. There were no individual consequences associated with the test results. Therefore, these assessments could be considered as low stakes tests. In addition to the assessment test data, Scholastic Assessment Test (SAT) scores, as a measure of examinees level of academic ability, were obtained from the student records database. Subjects with missing SAT scores were deleted from the analyses, resulting in a final sample size of 682, with 382 (56%) females and 300 (44%) males. Measures Achievement Test. To be consistent with Wise et al. s (2006) studies, the Understanding Our World (UOW) test was administered to the examinees. This is a locally developed computer based test (CBT) which assesses students abilities to use mathematics and science to understand the natural world and society. The test consisted of 42 multiple choice items with 2 to 5 response options per item. Response Time Effort. As described previously, the index of RTE is used to measure examinees level of test taking effort. More specifically, the amount of effort examinee j put forth in the test is given by RTE j SB ij =, (1) k where k = the number of items, SB ij is the indicator of whether solution behavior or rapid guessing exhibiting on item i. When an examinee s response time on item i exceeds the predefined item threshold, SB ij gets a value of 1, indicating the occurrence of solution behavior. Otherwise, SB ij gets a value of 0, indicating the occurrence of rapid guessing behavior. RTE scores can range from zero to one, and represent the proportion of test items for which the examinees exhibited solution behavior. RTE values near one indicate strong examinee effort to the test, and the farther a value falls below one, the less effort expended by the examinee (for more information, see S. L. Wise, & Kong, 2005).

Effects of EM Praise 5 Ability Measure. As mentioned above, Scholastic Assessment Test (SAT) scores, including Verbal and Quantitative scores, were used to assess examinees level of academic ability. Procedures All of the assessments were administered in a university lab containing 102 computers. At the beginning of the testing session, trained proctors welcomed the students, informed them of lab regulations, and then announced that students attendance would be counted only if they had completed all the assessments, which included both attitudinal instruments (e.g., Attitudes Towards Learning) and achievement tests (e.g., Fine Arts). Students had a maximum time limit of three hours to complete all the assessments. Within the three hours, they could work at their own pace. After providing an overview, the proctors distributed hard copy test instructions and allowed students to start. The test instructions contained such information as the order of the tests, test icons on the computer, and passwords. During the Assessment day, the UOW was the 6 th test (out of a total of 7 assessments) that students had to take. In the makeup testing sessions, a survey, with a time limit of 5 minutes, was added in the assessment list. This survey was the first one that students were instructed to start with. Thus, the UOW became the 7 th test out of a total of 8 tests. Based on response time data collected in previous years, 45 minutes would be adequate for completing the UOW. In other words, administration of the test was considered unspeeded. Similar to the procedures used in Wise, et al. (2006), the administration software randomly assigned examinees to three conditions: (a) praise condition; (b) warning condition; (c) no message (control) condition. For the warning condition, we used the same time thresholds for items in the UOW test as those specified in Wise et al. s (2006) study. Based on inspections of item response time distributions obtained in the previous study, we used a fixed 3 second time threshold for each item for the praise condition, which is more conservative in terms of identifying rapid guessing behavior. This allowed us to more effectively identify those who did give some level of effort in the test. Moreover, previous research suggests that response time effort is not very sensitive to the particular threshold identification method used (Kong, et al., in press). To be consistent with the prior study, we calculated RTE scores by using thresholds identified by Wise et al. (2006). We also adopted Wise et al. s (2006) criteria for displaying warning messages. Specifically, if an examinee exhibited three consecutive rapid guesses, the first warning message would be displayed. After that, if another set of three consecutive rapid guesses occurred, the second warning message would be displayed. The contents of the warnings are shown in Table 1.

Effects of EM Praise 6 Table 1. Warning Messages Displayed in the Effort Monitoring CBT Warning Order Warning Text First Your responses to this test indicate that you are not giving your best effort. It is very important that you try to do your best on the tests you take on Assessment Day. These assessment data are used by the university to better understand what students learn at <the university>, and what improvements need to be made. In addition, <the university s> assessment data are reported to the state as evidence of what <the university s> students know and can do. Second <Examinee Name>, your responses continue to indicate that you are not trying to do your best on this test. These tests are very important, and you need to give them serious attention. Students who do not take Assessment Day activities seriously may be required to attend an Assessment Day make up session. Note. From Taking the time to improve the validity of low stakes tests: The effort monitoring CBT, by S. L. Wise, D. S. Bhola, and S. Yang, 2006, Paper to be presented at the annual meeting of the National Council on Measurement in Education, San Francisco, VA. Adapted with permission of the author. For the praise condition, we decided to use two praise messages and the following criteria for displaying the messages: (a) if an examinee exhibited six consecutive solution behavior, the first praise message would be displayed; (b) after the first message, the immediate next six items would not be counted; (c) recounting started at the 7 th item; if the examinee again exhibited six consecutive solution behavior, the second praise message would be displayed. According to these criteria, the earliest possible item position that the first praise message could display was the 6 th item in the test, and the earliest possible item position that the second praise message could display was the 18 th item in the test. The contents of the praise messages are shown in Table 2. The message popped up in the middle of the computer screen when certain criteria for a praise or warning had been met. Examinees could close the message box by clicking on a Continue button. The time period during which a message was displayed was not included in the item response times.

Effects of EM Praise 7 Table 2. The Praise Feedback Displayed in the Effort Monitoring CBT Message Order Praise Text First Second Your responses to this test indicate that you are trying hard. Thank you for your efforts. This is very important to <the university>. Keep up the good work! <Examinee s First Name>, your responses continue to indicate that you are trying hard. Thank you for staying engaged in the test. This is very helpful to <the university>. Again, thanks! It can be seen from the tables that two praise messages were similar in tone while two warning messages were very different, with the second warning being more ominous. This was because the praise messages were designed to engage students who consistently exhibited solution behavior to stay engaged. In other words, the goal of the EM praise intervention was to reinforce desirable test taking behavior. In contrast, the goal of the EM warning intervention was to change undesirable test taking behavior. Results The average test performance for the entire praise group was higher than the control group performance, while lower than the warning group performance. The average test taking effort was about the same for the praise and control groups, which was.19 point lower than that of the warning group (see Table 3). Table3. Descriptive Statistics for Students Test Performance and Test taking Effort under Each Condition Praise Warning Control Variable n M SD α n M SD α n M SD α Test Performance 213 19.46 7.87.87 211 21.46 6.38.80 258 19.17 7.74.87 Test Taking Effort 213.52.41.99 211.71.29.97 258.52.38.99 With regard to internal consistency reliability, all three groups showed acceptable levels of test score reliability (Cronbach s α coefficients were.87,.80,.87 for praise, warning and control condition respectively). The RTE scores, as the measure of examinee

Effects of EM Praise 8 effort, showed adequate level of reliability (Cronbach s α coefficients for RTE scores were over.95 for all conditions). We hypothesized that examinees receiving praise messages would put forth more effort and perform higher in the test than those who were not receiving any EM feedback. As displayed in the Table 4, examinees who received praise messages in the praise condition performed slightly higher and exhibited somewhat higher level of effort in taking the test than examinees who deserved praise messages in the control condition. However, the mean differences between groups were not significantly different, and only small effect sizes were obtained, ranging from.09 to.41. Out of 158 examinees in the control condition who deserved the first praise message, 69% of them stayed engaged enough to deserve the second praise message. In comparison, out of 127 examinees who received the first praise, 72% of them received the second praise. The difference was not significant. We also hypothesized that the test performance and level of test taking effort would be higher for examinees in warning condition than those in control condition, as reported in Wise et al. (2006). Consistent with their findings, comparisons between the warning group and the control group revealed that the effects of EM warnings were significant. Both test performance and test taking effort were significantly higher for examinees who received warning messages. Effect sizes obtained were as large as 1.37 standard deviation difference in examinee effort and.61 standard deviation difference in test performance (see Table 4). Wise et al. (2006) also examined the variances in test scores and correlations of test scores with external variables (i.e., SAT verbal and SAT math scores). They found that, for examinees deserving the first warning, relationships were stronger under the warning condition, even with substantially reduced variance in test scores. These findings were replicated in our present study (see Table 5). However, this was not exactly the case for the praise condition. Although the relationships between test performance and SAT scores were higher for the praise group, the variance in test scores was also higher. None of these differences in correlations was statically significant.

Effects of EM Praise 9 Table 4. Effects of Experimental Conditions on Test Performance and Test Taking Effort Praise Warning Control Variable n M SD n M SD n M SD t df p d Examinees Deserving First Praise Test Performance 127 24.11 6.74 158 23.49 6.68.779 283.436.09 Effort after Deserving First Praise 127 0.83 0.26 158 0.77 0.27 1.837 283.067.23 Examinees Deserving Second Praise Test Performance 91 26.80 5.51 109 26.19 5.40.787 198.432.11 Effort after Second Praise 90 0.93 0.08 107 0.88 0.15 3.337 195.001.41 Proportion Deserving Second Praise 127 0.72 0.45 158 0.69 0.46.487 283.626.07 Examinees Deserving First Warning Test Performance 131 18.34 5.21 172 15.13 5.34 5.236 301 <.001.61 Effort after Deserving First Warning 131 0.59 0.31 172 0.21 0.25 11.643 301 <.001 1.37 Examinees Deserving Second Warning Test Performance 90 16.63 4.81 153 14.09 4.49 4.147 241 <.001.55 Effort after Second Warning 90 0.49 0.35 152 0.15 0.20 9.752 240 <.001 1.28 Proportion Deserving Second Warning 131 0.69 0.46 172 0.89 0.31 4.513 301 <.001 0.52

Effects of EM Praise 10 Table 5. Correlations between Test Performance and SAT Scores Treatment Group Praise Warning Control Difference in Variable N r N r N r Shared Variance 1 Significance 2 Examinees Deserving First Praise SAT Verbal 158.38 127.25.08.230 SAT Math 158.43 127.24.13.074 Examinees Deserving First Warning SAT Verbal 131.31 172.08.09.040 SAT Math 131.30 172.15.07.176 r r 1 2 2 Warning No Warning 2 Probability associated with t test of the difference between independent correlations Discussion In this study, the EM praise feedback influenced test performance and examinee effort in the hypothesized direction, but the effects were not strong enough to make a meaningful difference. Moreover, although the proportion of examinees who stayed engaged enough to deserve the second praise message was slightly higher in the praise condition than that in the control condition, the difference was not significant, and the effect size was rather small. In terms of reliability, the praise group showed internal consistency as high as the control group, for both test scores and examinee effort. The test score reliability for the warning group was also adequate but relatively lower, primarily due to the interruptions caused by warnings in the testing session. This finding of a relatively low level of test score reliability for the warning condition was consistent with that found by Wise et al. (2006). In addition to addressing the issue of EM praise effects, this study provides further evidence supporting the effectiveness of EM warning. Wise et al. (2006) reported that examinee effort was significantly higher in the warning condition. However, the effect of EM warning on test performance was nonsigificant. In the current study, group differences in both performance and effort were significant, and the effect sizes were

Effects of EM Praise 11 noticeably larger. For example, the largest effect size reported by Wise et al. (2006) was a standard deviation of.83, which referred to the group difference in test taking effort, for examinees deserving the second warning. The counterpart effect size in our study was 1.28 standard deviations. It is worth pointing out that the average RTE scores, as a measure of examinee motivation, were considerably lower in this study. According to Wise et al. (2006), for all examinees, the average RTE scores were.86 for the control group and.92 for the warning group. In this study, however, the corresponding values were.52 and.71. As we can see here, the effects of EM warning were stronger when the proportion of lowmotivation examinees was larger. The disturbingly low motivation in our study may be due to the number of tests that students had to complete before taking the UOW. In Wise et al. (2006), the UOW was the first assessment that students took. In the current study, the UOW was the 6 th or 7 th test. Students may have been fatigued and more reluctant to take another test by the time they started the UOW. In addition, the UOW is a scientific and quantitative reasoning test that requires a relatively higher level of cognitive processing. Given the issue of motivation and test taking effort, future assessment practices may need to reduce the number of tests, shorten the assessment length, and rearrange the order of instruments (from hard to easy, for instance). Although not as encouraging as the findings about EM warning, the evidence regarding the usefulness of EM praise should not be considered conclusive. The weak effects of EM praise obtained in the present study may be due to various reasons, such as the contents of the messages, or the criteria used for receiving positive feedback. Further investigations are needed for exploring the potential roles of EM praise in facilitating motivation and performance in computer based tests.

Effects of EM Praise 12 References American College Testing (1996). Student effort and performance on a measure of postsecondary educational development. Iowa City, IA: Author. Arvey, R., Strickland, W., Drauden, G., & Martin, C. (1990). Motivational components of test taking. Personnel Psychology, 43, 695 716. Baumert, J., & Demmrich, A. (2001). Test motivation in the assessment of student skills: The effects of incentives on motivation and performance. European Journal of Psychology of Education, 16, 441 462. Covington, M.V. (1998). The will to learn: A guide for motivating young people. New York: Harper and Row. DeMars, C. (1999, April). Does the relationship between motivation and performance differ with ability? Paper presented at the Annual Meeting of the National Council on Measurement in Education, Montreal, Canada. (ERIC Document Reproduction Service No. ED430037). Hancock, D. (2000). The impact of verbal praise on college students time spent on homework, Journal of Educational Research, 37, 32 47. Hancock, D. (2002). Influencing graduate students classroom achievement, homework habits and motivation to learn with verbal praise. Educational Research, 44, 83 95. Hembree, R. (1988). Correlates, causes, effects, and treatment of test anxiety. Review of Educational Research, 58, 47 77. Kong, Bhola, & Wise (in press). Setting the threshold parameter to differentiate solution behavior from rapid guessing behavior. Educational and Psychological Measurement. O Neil, H. F., Abedi, J., Miyoshi, J., & Mastergeorge, A. (2005). Monetary incentives for lowstakes tests. Educational Assessment, 10, 185 208. O Neil, H. F., Sugrue, B., & Baker, E. L. (1996). Effects of motivational interventions on the National Assessment of Educational Progress mathematics performance. Educational Assessment, 3, 135 157. Schnipke, D. L. (1995, April). Assessing speededness in computer based tests using item response times. Paper presented at the annual meeting of the National Council on Measurement in Education, San Francisco, CA. Schnipke, D. L., & Scrams, D. J. (1997). Modeling item response times with a two state mixture model: A new method of measuring speededness. Journal of Educational Measurement, 34, 213 232. Silverman, S. L. & Casazza, M. E. (2000). Learning and development. San Francisco, CA: Jossey Bass. Smith, L. F. (1999). The role of consequence on national standardized tests. Paper presented at the annual meeting of the National Council on Measurement in Education, Montreal, Canada.

Effects of EM Praise 13 Snow, R. E., Corno, L., & Jackson, D. (1996). Individual differences in affective and conative functions. In D. C. Berliner & R. C. Calfee (Eds.), Handbook of educational psychology (pp.243 310). New York: Macmillan. Sundre, D., & Kitsantas, A. (2004). An exploration of the psychology of the examinee: Can examinee self regulation and test taking motivation predict consequential and non consequential test performance? Contemporary Educational Psychology, 29, 6 16. Wainer, H. (1993). Measurement problems. Journal of Educational Measurement, 30, 1 21. Wigfield, A. (1994). Expectancy value theory of achievement motivation: A developmental perspective. Educational Psychology Review, 6, 1994. Wise, S. L., Bhola, D., & Yang, S. (2006) Taking the time to improve the validity of low stakes tests: The effort monitoring CBT. Paper to be presented at the annual meeting of the National Council on Measurement in Education, San Francisco, CA. Wise, S. L. (in press). An investigation of the differential effort received by items on a lowstakes, computer based test. Applied Measurement in Education. Wise, S. L., & DeMars, C. E. (2006). An application of item response time: The effortmoderated IRT model. Journal of Educational Measurement, 43, 19 38. Wise, S. L., & Kong, X. (2005). Response time effort: A new measure of examinee motivation in computer cased tests. Applied Measurement in Education, 16, 163 183. Wise, S. L., Owens, K., Yang, S., Weiss, B., Kissel, H., Kong, X., & Horst, J. (2005). An investigation of the effects of self adapted testing on examinee effort and performance in a low stakes achievement test. Paper presented at the annual meeting of the National Council on Measurement in Education, Montreal, Canada Wise, V. L. (2004). The effects of the promise of test feedback on examinee performance and motivation under low stakes testing conditions. Unpublished doctoral dissertation, University of Nebraska Lincoln. Wolf, L. F. & Smith, J. K. (1995). The consequence of consequence: Motivation, anxiety, and test performance. Applied Measurement in Education, 8, 227 242. Wolf, L. F., Smith, J. K., & Birnbaum, M. E. (1995). Consequence of performance, test motivation, and mentally taxing items. Applied Measurement in Education, 8, 341 351.