BACKGROUND CHARACTERISTICS OF EXAMINEES SHOWING UNUSUAL TEST BEHAVIOR ON THE GRADUATE RECORD EXAMINATIONS

Size: px

Start display at page:

Download "BACKGROUND CHARACTERISTICS OF EXAMINEES SHOWING UNUSUAL TEST BEHAVIOR ON THE GRADUATE RECORD EXAMINATIONS"

Gervase Lane
5 years ago
Views:

---5 BACKGROUND CHARACTERISTICS OF EXAMINEES SHOWING

Philip K. Oltman GRE Board Professional Report GREB No.

presents the findings of a research project funded by and

1 ---5 BACKGROUND CHARACTERISTICS OF EXAMINEES SHOWING UNUSUAL TEST BEHAVIOR ON THE GRADUATE RECORD EXAMINATIONS Philip K. Oltman GRE Board Professional Report GREB No. 82-8P ETS Research Report December 1985 l This report presents the findings of a research project funded by and carried out under the auspices of the Graduate Record Examinations Board.

2 Background Characteristics of Examinees Showing Unusual Test Behavior on the Graduate Record Examinations Philip K. Oltman GRE Board Professional Report No. 82-8P December by Educational Testing Service All rights reserved

3 Abstract We ordinarily expect item difficulty to be related to errors on tests; examinees generally tend to make errors on more difficult items and to answer easier items correctly. However, some examinees miss easy items and get more difficult items correct. The extent to which correct and incorrect responses are predicted by the difficulty of test items has been quantified in various ways. One method, originated by Sato (1975) and modified by Harnisch and Linn (1981), was used in the present study of item level data from the Graduate Record Examinations General Test. The modified Sato caution index was found to be of low reliability in these data and to be generally unrelated to a variety of background variables, although ethnic group showed a small but significant relation to the index. In these data the modified caution index showed a curvilinear relation to total test score, with examinees who showed very high or very low scores having higher index values than those in the middle range of test scores. Indexes calculated on the three sections of the test were uncorrelated with each other. Finally, the index did not moderate the relationship between test scores and self-reported grades, which it should have done if it indeed indicates how well the test measures the intended construct for any individual. The conclusion reached is that the modified caution index adds little information that would be of value in interpreting GRE test scores.

4 Introduction An examinee taking a test may achieve a given number of correct responses in a variety of ways. To cite the example given by Harnisch and Linn (1981), it is possible to achieve a score of 10 correct answers on a 20-item test in 184,756 different ways, that is, by answering 184,756 different patterns of correct and incorrect responses. While the variations among most of these patterns probably do not make much substantive difference, in other cases they might be important. If one examinee's 10 correct answers were the easiest 10 items,, and another's were the most difficult 10, one might hesitate to assert that the two scores of 10 mean the same thing. Admittedly, this large a difference in patterns would be a rare occurrence, but the example serves to illustrate the point that patterns of correct and incorrect responses may have information in them beyond what is provided by the total score. The total score, for example, would not tell us anything about patterns of strengths and weaknesses across different sections of the test, nor would it uncover evidence that the preparation of some examinees differed markedly from others in emphasis on certain areas. On the hypothetical 20-item test, given the 184,756 patterns of 10 correct responses and the astronomical number of patterns producing all the other possible total scores, a way of imposing some structure on the mass of potential information would be extremely useful. One way might be to search for clusters of examinees that show similar patterns of correct and incorrect responses. Another approach, which was followed in the present study, is to compare each obtained pattern of correct and incorrect responses with a benchmark pattern and calculate an index to indicate the extent of deviation of each pattern from that benchmark. A number of approaches have been developed using the item difficulties for the group as a template against which to compare each examrinee's pattern of responses. Underlying these methods is the notion that an examinee behaving in a perfectly orderly way would achieve correct responses on all items up to some level of difficulty, and then would show incorrect responses on all items more difficult than that level. For example, in our hypothetical situation, an "ideal" examinee with a score of 10 would achieve correct responses on the 10 easiest items and would miss the 10 most dj.ffi.cu1.t ones. Although this result would seldom be achieved exactly, most correct answers would be expected to come from the easier items and most errors would be expected to come from the more difficult items. The extent to which an examinee's pattern deviated from the "ideal" could well contain important information. At the very least, a tendency for an examinee to make errors on

-2- easy items and correct responses on difficult items would serve as a caution flag that the test score was produced by an unusual pattern of responses that may not be interpretable in the same

5 -2- easy items and correct responses on difficult items would serve as a caution flag that the test score was produced by an unusual pattern of responses that may not be interpretable in the same terms as a score obtained from a pattern that closely matches the item difficulties. If a group of examinees deviates from the expected pattern, then the normative item difficulties may not apply to them, and one might question whether the test measures the intended construct. Some of the methods developed to study unusual response patterns have been based on item response theory (e.g., Levine & Rubin, 1979), while others (e.g., Donlon & Fischer, 1968; van der Flier, 1977; Tatsuoka & Tatsuoka, 1980; Sato, 1975, described in English by Tatsuoka, 1978) have directly compared the pattern of correct and incorrect responses with the item difficulties to derive an index of deviation, or "caution" in Sato's terminology. After comparing several methods of computing indexes of deviation from the usual or expected pattern of responses, Harnisch and Linn (1981) suggested that a modification of Sate's caution index was preferred over the others they studied because it was least confounded with total score. We therefore selected Harnisch and Linn's modification of Sato's caution index to apply to item level data files in a study of one administration of the Graduate Record Examinations (GRE) General Test. The modified caution index ranges from 0 to 1, with higher values indicating greater departure from the usual pattern of response. That is, an individual with a high modified caution index has missed some easy items and gotten some difficult items correct. "Usual" patterns of response are defined by the total number of correct items achieved by an examinee. For example, for examinees achieving 5 items correct, the usual pattern would entail that those items be the 5 easiest items; for examinees with 10 items correct, the 10 easiest; and so on. If the n items marked correct ly are not the n easiest it ems, then th a.t pattern is unusual 9 *-the greater the difficult y of the n c 0 lrrect items, t more un usual the pattern becomes. The highest possible modif caution index value would be obtai ned by a pat t ern consisting items c.orrect that were the most d Ifficult n i t ems. Further details of the calculations produc ing the m;3hi f ied caution in dex can be found in Harnisch and Linn (1981). he ied of n - Our interest was in the correlates of unusual test behavior on the GRE test. Would it be possible to find some pattern in the information available in the background data collected during the GRE test administration that is characteristic of examinees showing unusual test responses? For example, it seemed possible that ethnic minority groups might differ from the majority to the extent that ethnic background produces differing cultural and educational experiences that would be expressed in test performance.

7 -4- The modified caution index was computed for each examinee, separately for each score on the GRE General Test. The computations were carried out using a computer program designed for this purpose (Harnisch, Kuo, & Torres, 1982), and the resulting indexes were added to the data record of each examinee. Calculation of Modified Caution Index If each examinee's record consists of a row of O's and l's representing correct and incorrect responses, the data from the entire sample can be portrayed as a matrix, with rows representing examinees and columns representing items. To calculate the modified caution index, the columns are rearranged so that the column sums decrease from left to right (that is, the easiest items are toward the left). Similarly the rows are rearranged so that the row sums decrease from top to bottom (that is, the examinees with higher scores are toward the top). Given this matrix, Sato's caution index, as modified by Harnisch and Linn (1981) is given by the following formula: n. J 1. t (1 - uij)nej - 1 uijnmj Caution Index = j=l j=ni.+l n. J 1. t j=l n. - l J t n.j j=j+l-ni. where i = 1,2,... I indexes the examinee, j = 1,2,... J indexes the item, U ij = 1 if examinee i answers item j correctly, and 0 if examinee i answers item j incorrectly, n = total correct for the i th i. examinee, and n.j = total number of correct responses to the jth item.

8 -5- Results Score Means and Distributions of Sex, Ethnic Group, and Major The means and standard deviations of scores on the verbal, quantitative, and analytical sections of the GRE General Test for each sample and for the population are shown in Table 1. Also shown in Table 1 are the distributions of sex, ethnic group, and undergraduate major for the samples and the population. From these data it is apparent that the samples accurately represent the population from which they were drawn. In none of these comparisons did the samples differ significantly from the population or from each other. Characteristics of the Modified Sato Caution Index The means and standard deviations of the modified caution indexes computed from the verbal, quantitative, and analytical scores on the test are shown in Table 2 for each sample. The samples did not differ from each other in means or standard deviations on any of the scores on the test. The means, ranging from.20 to.24, and the standard deviations, ranging from.06 to.08, are comparable to those reported by Harnisch and Linn (1981), although our data were from a general aptitude test taken by graduate-school-bound examinees and those of Harnisch and Linn were from an achievement test taken by fourth-graders. The distributions of the modified caution index for each of the three parts of the test are shown in Table 3. While the distributions of observed indexes are not markedly asymmetrical, they cluster in the lower range of possible values, with very few exceeding.50. To assess reliability, pairs of indexes were computed for each examinee, one for the odd-numbered items in a test, and one for the even-numbered items. Spearman-Brown reliability estimates based on the correlations between the odd and even indexes were quite low, ranging between.15 and.29. Plots of odd indexes versus even indexes were examined, but nothing particularly unusual was found. The plots had no extreme outliers and were otherwise unremarkable, except for the low relation between the two indexes. This apparent lack of reliability obviously makes it rather unlikely that the index will correlate with anything else. Correlations between Indexes from Sections of the Test Correlations were computed between the indexes calculated for each of the three measures: verbal, quantitative, and analytical. There was no evidence that unusual responding was a "trait," in the

9 -6- sense that it might cut across test sections. The correlations among the indexes from the three measures hovered around zero. Correlates of the Modified Sato Caution Index Correlations and multiple regression analyses were performed to explore the data for possible correlates of degree of unusual responding. In each analysis the relation between the caution index and a given variable was computed with total score on the test held constant. The aim was to use the background data to characterize examinees with varying degrees of departure from the "usual" pattern of test response. In every case, the results from the two samples were almost identical. To conserve space in what follows, only the results for the first sample will be described. Background information questionnaire. None of the background information items was substantially related to the caution indexes calculated from the verbal, quantitative, or analytical item data. The largest correlation observed was a multiple correlation of.14 between the set of all ethnic groups and the caution index calculated from the verbal item responses. The largest difference accounting for the multiple correlation was between the Black and White examinees: the 101 Black examinees had a mean index for the verbal measure of.26 (SD =.08), while the 1,725 White examinees' mean index was.21 (SD r.06). This difference was statistically significant (2 <.Ol)because of the large sample size. However, given the size of the difference, it would be difficult to claim much utility for the modified caution index in the interpretation of GRE General Test scores. The test scores of the Black examinees were generally lower than those of the White group, which may have had some effect on the caution indexes as well. To test the mean caution index difference more directly, a sample of 101 White examinees was drawn to match the 101 Black examinees on test scores; three separate White samples were drawn to match on each of the three parts of the test. To carry out the matching, the distributions of Black examinees' scores were divided into six intervals, and White examinees were randomly drawn from each interval to match the number of Black examinees from that interval. The mean indexes did not differ significantly between the Black and matched White groups, suggesting that the observed mean differences in caution indexes in the total sample were due to differences in test score levels. When the test score levels were made equivalent for Black and White groups, the difference in caution index disappeared. Scores on sections of the test. One of the reasons that Harnisch and Linn (1981) recommended the use of the modified Sato caution index was that it showed a very low correlation with test scores in their data. If a particular index were to show a substantial correlation with test scores, it would be difficult to

10 -7- use it as an independent source of information about test performance that goes beyond the total score. Over all examinees, we found quite low correlations also in our data. The linear correlations between the verbal, quantitative, and analytical scores and the modified caution indexes were -.16,.02, and -.lo, respectively. However, when we calculated these correlations for each ethnic group separately, we found rather different results. Table 4 displays these results and shows that, while the correlations between test scores and index values for White examinees were indeed around zero, those for other groups were considerably higher. In particular, the Black examinees showed consistently negative correlations, indicating that the lower the score, the more unusual was the pattern of correct and incorrect responses. Scatter plots of each test score plotted against its respective index value were examined to try to determine whether they might help explain why the correlations differed from one ethnic group to another. Curvilinearity was apparent, and Table 5 displays how much the multiple correlation was increased by adding a squared test score term to the regression. In each case the increase was significant. The shape of the curvilinearity suggested that examinees with extremely high or extremely low test scores had higher index values. The observed curvilinearity probably accounts fully for the differences in correlations between test scores and indexes for the various ethnic groups. The case is most clear-cut for the Black examinee group, who scored significantly lower on each part of the test than did the White examinees. Given the curvilinearity in the scatter plots, any group scoring near the low end of the distribution would show a negative slope. Correlations were calculated between test scores and indexes for the matched White samples (described above) for the verbal, quantitative, and analytical scores. These, along with the comparable correlations from the Black examinees' data, are displayed in Table 6, where it can be seen that the two sets of correlations are very similar. Thus it seems reasonable to conclude that the substantial negative correlations between test scores and caution index values for Black examinees can be attributed to the fact that the Black examinees' generally lower scores put that group on the descending arm of the U-shaped curve. As can be seen from the correlations calculated for the matched White samples, a group of White examinees with test scores similar to the Black group's scores showed similar correlations.

11 -8- The Caution Index as a Moderator The implication of a high caution index is that the test score accompanying it is somehow difficult to interpret and is perhaps not a valid indicator of the underlying dimension we intend to estimate. The caution index would be important information to have available when interpreting a test score if it indeed gave such information about the validity of the score. Following this reasoning, one might expect that the caution index would moderate the relationship between test scores and various criteria they are intended to predict. That is, one might expect that grades would be better predicted by test scores for examinees with low caution indexes than for those with high caution indexes. If high caution indexes in a group of examinees indicate that the meaning of their test scores is not clear, then their grades and other criteria should not be predicted as well. While subsequent actual grades were not available to test this interpretation of the caution index, the examinees did provide self-reports of their undergraduate grades. Correlations between these self-reports of grades and scores on the three measures of the General Test are shown in Table 7 for the total group, for those "low caution" examinees who had indexes in the bottom 5 percent of the distribution, and for "high caution" examinees who were in the top 5 percent on the index. In each case, the correlations were higher for the "high caution" group, which is the reverse of what was expected. Regressions were compared for the high and low groups to check the possibility that differences in variances might have produced the differing correlations shown in Table 7, but the regressions showed the same pattern as the correlations. Discussion The aim of the study was to explore some of the correlates of unusual patterns of test response on the GRE General Test using the modified Sato caution index as an indicator of the extent to which a pattern of correct and uncorrect responses could be considered unusual. Given the available background data, a search was conducted for evidence that would speak on the potential usefulness of the index in the interpretation of GRE General Test scores. Generally, the conclusion to be drawn from these analyses is that the index is of limited usefulness in this particular context. Perhaps the most problematical finding was the low reliability of the index. Furthermore, indexes calculated for each of the three sections of the test were uncorrelated with each other. Thus it is not possible to interpret unusual response as a "trait." That is, index values apparently do not indicate anything general

12 -9- about an examinee's approach across tests in general, or even across sections of a test. If indexes are interpretable, such interpretation would pertain only to the particular part of the test on which the index was calculated, at least in these data. The means and distributions of the indexes we calculated were comparable to those reported by Harnisch and Linn (1981). However, we found more curvilinearity in the relationship between test score and index than was expected from Harnisch and Linn's report. There was a significant tendency for high index values to be associated with extreme test scores, either high or low. The description of the index suggested that it would have considerable utility as a moderator variable. If indeed a high index value indicated that interpretation of its associated score must be made with caution, and a low index value indicated that the score was a straightforward reflection of the underlying construct, then correlations between test scores and criteria should be higher for examinees with low indexes than for those with high indexes. High caution implies doubts about validity. No evidence supporting such a use for the index was found in these data. Correlations between test scores and self-reported grades did not differ in the expected direction between high- and low-index groups. Perhaps the caution index behaves differently in aptitude test data than it does in achievement test data. Little evidence was found in the GRE data of consistent individual differences in extent of unusual responding or of a strong source of variance independent of the test scores. Achievement tests that have a closer relation between instruction and assessment may be more likely to contain sources of unusual response pattern variance that would be usefully related to individual examinee characteristics. In summary, none of the analyses that were performed provided evidence that examinees showing unusual patterns of test response on the GRE General Test, as indicated by the modified Sato caution index, had unique background characteristics. The caution index did not provide information that would be useful to users of the GRE General Test. While there may indeed be information in the pattern of test responses beyond what is provided by the total score, this particular index calculated from these data did not prove to be informative.

13 -lo- References Donlon, T. F., & Fischer, F. E. (1968). An index of an individual's agreement with group-determined item difficulties. Educational and Psychological Measurement, 28, Harnisch, D. L., Kuo, S., & Torres, R. T. (1982). SPP: Student Problem Package (Version 1.0). Champaign, IL: University of Illinois Office of Educational Testing, Research, and Service. Harnisch, D. L., & Linn, R. L. (1981). Analysis of item response patterns: Questionable test data and dissimilar curriculum practices. Journal of Educational Measurement, 18, Levine, M. V., & Rubin, D. B. (1979). Measuring the appropriateness of multiple choice test scores. Journal of Educational Statistics, 5, Sato, T. (1975). [The construction and interpretation of S-P tables.] Tokyo: Meiji Tosho. Tatsuoka, M. M. (1978). Recent psychometric developments in Japan: Engineers grapple with educational measurement problems. Paper presented at the ONR Contractors Meeting on Individualized Measurement, Columbia, MO. Tatsuoka, K., & Tatsuoka, M. M. (1980). Detection of aberrant response patterns and their effects on dimensionality (Research Report 80-4). Urbana, IL: University of Illinois, Computer-Based Education Research Laboratory. van der Flier, H. (1977). Environmental factors and deviant response patterns. In Y. H. Poortinga (Ed.), Basic problems in cross-cultural psychology. Amsterdam: Swets and Seitlinger, B.V.

14 -ll- Table 1 Means, Standard Deviations, and Distributions of Major Descriptive Variables Sample Aa Sample Ba Populationb GRE Scores M SD M SD M SD Verbal Quantitative Analytical Distributions Sex Percent Percent Percent Female Male Ethnic Group Amerindian Black Chicano Oriental Puerto Rican Other Hispanic White Other Undergraduate Major Humanities Social Sciences Biological Sciences Physical Sciences an bk - = 1,994 = 57,814

15 -12- Table 2 Means and Standard Deviations of Modified Caution Indexes Calculated on Verbal, Quantitative, and Analytical Measures of the GRE General Test Sample A Sample B M SD M SD - - GRE Measures Verbal Quantitative Analytical

16 -13- Table 3 Distributions of Modified Caution Index Calculated from the Verbal, Quantitative, and Analytical Measures of the GRE General Test Measures Verbal Quantitative Analytical Samples: A B A B A B Percentilesa lO l ascores in body of table are at the percentiles indicated by the row headings for the measures indicated by the column headings.

17 -14- Table 4 Correlations between GRE Verbal, Quantitative, and Analytical Scores and Modified Caution Index on Corresponding Measures by Ethnic Group Measures N Verbal Quantitative Analytical - Ethnic Group Amerindian Black Chicano Oriental Puerto Rican a 26 Hispanic, Other White Other

18 -150 Table 5 Tests of Curvilinearity of Regression of Caution Index on Test Scores Measures Verbal Quantitative Analytical Type of Correlation with Test Score Lineara Curvilinearb lO a Test score versus caution index. bmultiple correlation, test score, and test score squared versus caution index.

19 -16- Table 6 Correlations Between Test Scores and Caution Indexes for Black and for White Examinees Matched on Test Scores Measures Verbal Quantitative Analytical Group Black White Matched a -.3gb -.24' awhite group matched on verbal score. b White group matched on quantitative score. 'White group matched on analytical score.

20 -17- Table 7 Correlations Between Self-Reported Grades and Test Scores for Different Levels of the Caution Index Measures Verbal Quantitative Analytical Groups Total Sample "Low Cautiona".ll "High Cautionb" alower 5 percent on caution index. b Upper 5 percent on caution index.

Section 3.2 Least-Squares Regression

Section 3.2 Least-Squares Regression Linear relationships between two quantitative variables are pretty common and easy to understand. Correlation measures the direction and strength of these relationships.