Unproctored Internet-Based Tests of Cognitive Ability and Personality: Magnitude of Cheating and Response Distortion

Industrial and Organizational Psychology, 2 (2009), 39 45. Copyright ª 2009 Society for Industrial and Organizational Psychology. 1754-9426/09 Unproctored Internet-Based Tests of Cognitive Ability and Personality: Magnitude of Cheating and Response Distortion WINFRED ARTHUR, JR., RYAN M. GLAZE, AND ANTON J. VILLADO Texas A&M University JASON E. TAYLOR People Answers, Inc. Malfeasant behaviors on tests may take one of two forms cheating or response distortion. Cheating is associated with ability and knowledge tests, and response distortion is associated with noncognitive/personality measures. Pursuant to the cheating and response distortion-related questions posed by Tippins (2009), the objective of the present article is to present a summary of the pertinent results of a study (i.e., Arthur, Glaze, Villado, & Taylor, 2008) that was designed to (a) investigate the efficacy of using a speeded test format to address concerns about cheating on an unproctored Internetbased cognitive ability test and (b) document the extent to which the magnitude of response distortion on two unproctored Internet testing (UIT) personality measures was similar to or different from that reported Correspondence concerning this article should be addressed to Winfred Arthur, Jr. E-mail: w-arthur@ tamu.edu Address: Department of Psychology, Texas A&M University, 4235 TAMU, College Station, TX 77843-4235 Winfred Arthur, Department of Psychology, Texas A&M University; Ryan M. Glaze and Anton J. Villado, Department of Psychology, Texas A&M University; Jason E. Taylor, People Answers, Inc. for proctored personality measures in the extant literature. Cheating on Cognitive Ability Tests It is not unreasonable to assume that cheating can and does occur in employment testing. Although we were unable to locate any published empirical studies on the prevalence of cheating in employment testing, the prevalence of cheating in educational settings seems to be quite widespread. For example, reviews by Cizek (1999) and Whitley (1998) indicate that approximately half of all college students reported cheating on an exam at least once during their college education. Furthermore, it seems reasonable that the prevalence of cheating is likely to be exacerbated with the use of UIT because proctoring is the primary means by which cheating is curtailed. Several techniques to prevent and detect cheating are noted by Tippins. As yet another potential technique, Arthur et al. (2009) investigated the efficacy of the use of speeded tests that, by virtue of their time constraints, may serve to curtail the anticipated cheating behaviors. Of course, this is predicated on the assumption that a speeded administration is consonant 39

40 W. Arthur et al. with the job-relatedness of the test. Assuming there is no preknowledge of test content, possible modes of cheating (e.g., using additional aids or helpers) are not independent of time. That is, if individuals do not have a preknowledge of the test content, and are not using surrogate test takers, then time constraints should make it more difficult to and subsequently deter them from engaging in malfeasant behaviors under speeded conditions. Consequently, an objective of Arthur et al. was to investigate whether observed retest score changes on a speeded UIT cognitive ability test (120 items with a 20-minute administration time) could be best accounted for by a cheating or psychometric retest explanation. In addition, Arthur et al. also investigated whether the speeded UIT cognitive ability test differed from proctored cognitive ability tests (in the extant literature) in terms of retest score changes. Response Distortion on Personality Tests As previously noted, the primary malfeasant behavior of interest with personality and other noncognitive measures is response distortion. Although there is no clear consensus about their effectiveness and efficacy, several techniques have been proposed for preventing or minimizing response distortion on personality measures. These include the use of forced-choice responses, empirical keying, warnings of verification, and response elaboration (e.g., see, Hough, 1998 for a review). However, absent from this list of techniques is the use of test proctors because the presence of test proctors has little or no bearing on controlling response distortion. Consequently, in contrast to cognitive ability (and knowledge) tests, concerns regarding response distortion on personality and other noncognitive tests are the same in both proctored and unproctored testing conditions. In summary, it is not unreasonable to expect that the magnitude of response distortion should be similar for both UIT and proctored personality tests. Hence, a second objective of Arthur et al. (2009) was to investigate whether unproctored and proctored personality tests (based on the extant literature) differed in terms of the magnitude of response distortion specifically, do UIT personality tests display higher levels of response distortion than proctored tests? Summary of Arthur et al. s (2009) Results Arthur et al. accomplished their study objectives by implementing two within-subjects design studies (Study 1, N ¼ 296; Study 2, N ¼ 318) in which test takers first completed UIT cognitive ability and personality measures as job applicants (high stakes) or incumbents (low stakes), then as research participants (low stakes). Thus, applicants were considered to have experienced highstakes testing and low-stakes retesting (i.e., HL-stakes) and incumbents were posited to have experienced low stakes during both testing and retesting (i.e., LL-stakes). Because the testing firm uses the tests for a wide range of positions for their clients, Arthur et al. s participants represented a variety of jobs (17 Standard Occupational Classification [SOC] major groups were represented) in a number of organizations (14 North American Industry Classification System [NAICS] industry types were represented). Across both studies, the three most common SOC groups were management occupations, business and finance occupations, and sales and related occupations, and for the NAICS industry types, the three most common were retail trade; finance and insurance; and professional, scientific, and technical services. In Study 1, participants completed both the cognitive ability and the personality measures. However, in Study 2, participants completed only a personality measure (which was different from that used in Study 1), and in addition, there were no Time 1 incumbents in Study 2. For the proctored ability and personality test comparisons, Arthur et al. compared their results to the meta-analytic results reported by Hausknecht, Halpert, Di Paolo, and Moriarty- Gerrard (2007) and Birkeland, Manson, Kisamore, Brannick, and Smith (2006),

Unproctored Internet testing 41 respectively. The pertinent findings of these meta-analyses are briefly reviewed below. The results of Hausknecht et al. s (2007) meta-analysis of practice effects on proctored cognitive ability tests indicated retest improvements in scores both in the presence (d ¼ 0.64, k ¼ 23, N ¼ 2,323) and in the absence (d ¼ 0.21, k ¼ 75, N ¼ 81,374) of interventions such as test coaching and training. In addition, the mean improvement in test scores under operational (d ¼ 0.27, k ¼ 19, N ¼ 61,795) and research-based testing conditions (d ¼ 0.22, k ¼ 88, N ¼ 72,641) was quite similar. It should be noted that unlike Arthur et al. (2009), the testing conditions for the test and retest in Hausknecht et al. were identical. Thus, for the operational setting, both conditions were high stakes, and likewise for the research conditions, both were low stakes. For the issues at hand, the operational results are most relevant. Birkeland et al. (2006) investigated job applicant response distortion (faking) on personality measures by comparing applicants and incumbents scores on the five-factor model (FFM) of personality dimensions. Birkeland et al. also drew the distinction between indirect and direct measures of the FFM dimensions where direct measures (e.g., the NEO-FFI) were defined as those that were specifically designed to measure the FFM personality factors. In contrast, indirect measures (e.g., the 16PF) were not specifically designed to measure the FFM personality factors but could and were sorted into the FFM personality dimensions. As a summary statement, Birkeland et al. s results generally indicated higher mean scores for applicants. However, their results also indicated that the magnitude of the mean differences between applicants and incumbents was a function of the personality dimension and the test type (i.e., direct vs. indirect measure). Although Arthur et al. (2009) used indirect FFM measures, their results are compared to both Birkeland et al. s indirect and direct measure results. The results of Arthur et al. s (2009) studies are summarized in Tables 1 and 2 and Figures 1 and 2. For Figures 1 and 2, Arthur et al. used the standard error of measurement (SEM) of the difference scores (i.e., 1 SEM d ) to identify individuals who may have engaged in malfeasance. A number of summary statements can be made on the basis of their results. First, the use of a speeded UIT cognitive ability time resulted in high-stakes/ low-stakes retest effects that were more consonant with a psychometric practice effect than a malfeasance explanation. Specifically, consistent with psychometric theory, the Time 2 scores were moderately higher than the Time 1 scores (d ¼ 0.39). These findings are consistent with those reported by Hausknecht et al. (2007). Furthermore, using the 1 SEM d to operationalize malfeasance (i.e., differences between Time 2 and Time 1 scores that fell below the 1 SEM d band [Time 2 score was lower than the Time 1 score] were considered to be evidence of malfeasance) indicated that there was a relatively small percentage of cheaters (7.77%). 1 In interpreting these data, it should be noted that lower Time 2 scores (which we infer to be cheating ) could be because of (a) the Time 1 score being elevated because of cheating and the Time 2 score being the true score or (b) the Time 1 score being the true score and Time 2 score being an unmotivated test performance score (i.e., the participant did not take the test seriously). Given their design and data, Arthur et al. were unable to distinguish or differentiate between these two causes or explanations of the observed score difference. The pattern of results and associated conclusions of Arthur et al. are consistent with those of Nye, Do, Drasgow, and Fine (2008) who used a perceptual speed test. Consequently, the percentage of cheaters in Arthur et al. s sample conceivably ranges from 0 (all lower Time 2 scores are because of explanation 2) to 7.7% (all lower Time 2 scores are because of explanation 1). Hence, 7.7% might be best viewed as the upper limit for cheating in their sample. So in their totality, their 1. Although not summarized here, Arthur et al. (2008) also present data on the distributional placement (i.e., position in the score distribution) of malfeasant responders on both the cognitive ability and the personality measures. These data are available upon request or can be obtained from the article.

42 W. Arthur et al. Table 1. Comparison of Arthur et al. s (2009) Study 1 Cognitive Ability (Unproctored) Test Retest Standardized Mean Differences (d) and Retest Reliability Coefficients With Hausknecht et al. s (2007) Meta-Analytic Results Arthur et al. Study 1 (unproctored) Hausknecht et al. (proctored) Total HL LL Op a Rsch b Cognitive ability 0.39* 0.36* 0.57* 0.27 0.22 Time 22Time 1 d Retest reliability.78.77.84.82 c Note. N ¼ 296, HL n ¼ 252, LL n ¼ 44. ds were computed by subtracting the Time 1 scores from the Time 2 scores so a positive d indicates that the Time 2 score is greater than the Time 1 score. a Operational data. b Research-based data. c Hausknecht et al. do not present an operational/research retest reliability breakdown, and thus, the reliability estimate is for both settings. For comparative purposes, Arthur et al. s mean retest interval was 429.16 days (SD ¼ 54.84, median ¼ 419.50); in contrast, Hausknecht et al. report a mean of 134.52 days (SD ¼ 304.67, median ¼ 20.00). *p,.05, two-tailed. results did not support the presence of widespread score increases on the speeded ability test as a result of high-stakes testing. Consequently, it would seem that the use of speeded tests, assuming they are not at odds with the job requirements, might be one means of alleviating cheating concerns with UIT ability tests. Second, because proctoring is not a technique that is intended to prevent or minimize response distortion on noncognitive measures, it was expected that UIT personality measures would display levels of response distortion similar to those reported for proctored measures in the extant literature (e.g., Birkeland et al., 2006). The results of both of Arthur et al. s (2008) studies supported this supposition. Thus, like proctored measures, FFM dimension scores were generally higher in the high-stakes condition than the lowstakes. In addition, the retest reliability coefficients were similar in magnitude and range to those reported for proctored tests using similar designs (e.g., see Ellingson, Sackett,& Connelly, 2007, Table 5; Hogan, Barrett, & Hogan, 2007, Tables 2 & 3). Furthermore, as Table 2. Comparison of Arthur et al. s (2009) Study 1 and Study 2 Personality Factors (Unproctored) Test Retest Standardized Mean Differences (d) and Retest Reliability Coefficients With Birkeland et al. s (2006) Meta-Analytic Results Five-factor model dimensions Arthur et al. Study 1 (unproctored) Arthur et al. Study 2 (unproctored) Birkeland et al. (proctored) Total HL LL HL Direct Indirect d r xx d r xx d r xx d r xx d d Agreeableness 20.63*.53 20.66*.53 20.43*.56 20.47*.51 20.51 0.15 Conscientiousness 20.66*.53 20.72*.55 20.33*.49 20.22*.62 20.79 20.15 Emotional stability 20.63*.49 20.72*.51 20.11.50 20.54*.55 20.72 20.24 Extraversion 20.29*.73 20.31*.70 20.19*.87 20.14.78 20.18 20.07 Openness 20.20*.67 20.26*.66 20.11.75 0.16.61 20.28 20.02 Note. Study 1, N ¼ 296 (HL n ¼ 252, LL n ¼ 44). Study 2, N ¼ 318. ds were computed by subtracting the Time 1 scores from the Time 2 scores so a positive d indicates that the Time 2 score is greater than the Time 1 score. The mean retest interval for Study 1 was 429.16 days (SD ¼ 54.84, median ¼ 419.50), and for Study 2, the mean was 306.98 days (SD ¼ 35.84, median ¼ 305.00). *p,.05, two-tailed.

Unproctored Internet testing 43 100% 90% 80% Percentage of Test Takers 70% 60% 50% 40% 30% 49.32% 42.91% 20% 10% 7.77% 0% Evidence of Practice Effect Evidence of Stability Evidence of Cheating Score Difference Categories Figure 1. Percentage of Arthur et al. (2008) Study 1 test takers as per 1 SEM d operationalization of practice and cheating effects on the cognitive ability test. with proctored tests, the magnitude of the score shifts was generally higher for Agreeableness, Conscientiousness, and Emotional Stability compared to Extraversion and Openness. However, it should be noted that compared to Birkeland et al. s results for indirect measures, the retest effects for Arthur et al. s data tended to be slightly larger a pattern of results that is consistent with the finding (see Edens & Arthur, 2000; Viswesvaran & Ones, 1999) that larger response distortion effects are generally obtained for within-subject designs (Arthur et al. s data) than betweensubject designs (Birkeland et al. s data). Finally, the percentages of individuals identified as having elevated applicant scores were similar to those reported by Griffith, Chmielowski, and Yoshita (2007) for proctored tests, that is 30 50%. In summary, on the basis of Arthur et al. s (2009) data, one could conclude that UIT personality measures display mean score shifts between high- versus low-stakes testing conditions that are similar to those reported for proctored personality measures. In contrast, we suspect that if the cognitive ability test used in Arthur et al. s study had not been speeded, therefore increasing opportunities to engage in malfeasant behaviors, it would have displayed test scores in the high-stakes condition that would have been higher than those reported for proctored ability tests. However, it would seem that the use of a speeded test resulted in retest effects that were similar to those reported for proctored ability tests in the extant literature. Conclusions The efficacy of UIT employment testing is threatened by the possible influence of malfeasance on the part of applicants. However, we review the results of a study that suggests that even under conditions where it is very intuitive to expect widespread cheating and malfeasance (i.e., high-stakes unproctored ability testing), there was no discernable effect on test scores when the testing condition was designed to counter such behavior. The UIT using a speeded test format appears to have reduced the opportunity for and thus the prevalence of malfeasant behavior. However, although they may curtail

44 W. Arthur et al. 100% Study 1 90% 80% Percentage of Test Takers 70% 60% 50% 40% 30% 20% 10% 0% 64.86% 62.84% 59.80% 59.80% 51.35% 35.81% 34.12% 33.11% 35.81% 25.68% 12.84% 14.53% 4.39% 3.04% 2.03% Agreeableness Conscientiousness Emotional Stability Extraversion Openness Evidence of Response Distortion (Low) Evidence of Stability Evidence of Response Distortion (High) 100% Study 2 90% 80% Percentage of Test Takers 70% 60% 50% 40% 30% 20% 10% 0% 65.72% 63.52% 58.49% 56.60% 49.06% 34.59% 36.16% 32.39% 22.96% 18.55% 21.70% 14.78% 11.32% 6.92% 7.23% Agreeableness Conscientiousness Emotional Stability Extraversion Openness Evidence of Response Distortion (Low) Evidence of Stability Evidence of Response Distortion (High) Figure 2. Percentage of Arthur et al. (2009) Study 1 and Study 2 test takers as per 1 SEM d operationalization of response distortion on the personality measure. cheating-related behaviors that are temporally resource demanding, they may do little to preempt the use of surrogate test takers. Nevertheless, where appropriate, the use of speeded UIT ability tests may be an additional option or alternative to on-site retesting. Similarly, Arthur et al. s results suggest that using a UIT administration does not uniquely threaten personality measures in terms of elevated levels of response

Unproctored Internet testing 45 distortion that are higher than those observed for proctored tests. Specifically, the pattern of high-stakes versus low-stakes retest effects observed for UIT and proctored personality measures was quite similar. References Arthur, W., Jr., Glaze, R. M., Villado, A. J., & Taylor, J. E. (2009). The magnitude and extent of cheating and response distortion effects on unproctored internetbased tests of cognitive ability and personality. Manuscript submitted for publication. Birkeland, S. A., Manson, T. M., Kisamore, J. L., Brannick, M. T., & Smith, M. A. (2006). A meta-analytic investigation of job applicant faking on personality measures. International Journal of Selection and Assessment, 14, 317 335. Cizek, G. J. (1999). Cheating on tests: How to do it, detect it, and prevent it. Mahwah, NJ: Lawrence Erlbaum Associates. Edens, P. S., & Arthur, W., Jr. (2000). A meta-analysis investigating the susceptibility of self-report inventories to distortion. Paper presented at the 15th annual conference of the Society for Industrial and Organizational Psychology, New Orleans, LA. Ellingson, J. E., Sackett, P. R., & Connelly, B. S. (2007). Personality assessment across selection and development contexts: Insights into response distortion. Journal of Applied Psychology, 92, 386 395. Griffith, R. L., Chmielowski, T., & Yoshita, Y. (2007). Do applicants fake? An examination of the frequency of applicant faking behavior. Personnel Review, 36, 341 355. Hausknecht, J. P., Halpert, J. A., Di Paolo, N. T., & Moriarty-Gerrard, M. O. (2007). Retesting in selection: A meta-analysis of coaching and practice effects for tests of cognitive ability. Journal of Applied Psychology, 92, 373 385. Hogan, J., Barrett, P., & Hogan, R. (2007). Personality measurement, faking, and employment selection. Journal of Applied Psychology, 92, 1270 1285. Hough, L. M. (1998). Effects of intentional distortion in personality measurement and evaluation of suggested palliatives. Human Performance, 11, 209 244. Nye, C. D., Do, B., Drasgow, F., & Fine, S. (2008). Twostep testing in employee selection: Is score inflation a problem? International Journal of Selection and Assessment, 16, 112 120. Tippins, N. T. (2009). Internet alternatives to traditional proctored testing: Where are we now? Industrial and Organizational Psychology: Perspectives on Science and Practice, 2, 2 10. Viswesvaran, C., & Ones, D. S. (1999). Meta-analyses of fakability estimates: Implications for personality measurement. Educational and Psychological Measurement, 59, 197 210. Whitley, B. E. (1998). Factors associated with cheating among college students: A review. Research in Higher Education, 39, 235 274.