Variations in Mean Response Times for Questions on the. Computer-Adaptive General Test: Implications for Fair Assessment

Size: px
Start display at page:

Download "Variations in Mean Response Times for Questions on the. Computer-Adaptive General Test: Implications for Fair Assessment"

Transcription

1

2 Variations in Mean Response Times for Questions on the Computer-Adaptive General Test: Implications for Fair Assessment Brent Bridgeman Frederick Cline GRE No. 96-2P June 2 This report presents the findings of a research project funded by and carried out under the auspices of the Graduate Record Examinations Board Educational Testing Service, Princeton, NJ 854

3 Researchers are encouraged to express freely their professional judgment. Therefore, points of view or opinions stated in Graduate Record Examinations Board Reports do not necessarily represent official Graduate Record Exarninations Board position or policy. ******************** The Graduate Record Examinations Board and Educational Testing Service are dedicated to the principle of equal opportunity, and their programs, services, and employment policies are guided by that principle. EDUCATIONAL TESTING SERVICE, ETS, the ETS logo, GRADUATE RECORD EXAMINATIONS, and GRE are registered trademarks of Educational Testing Service. Copyright 2 by Educational Testing Service. All rights reserved.

4 Abstract In a computer adaptive test (CAT), different examinees receive different sets of questions. Questions at the same overall difficulty level and meeting roughly the same content specifications could vary substantially in the amount of time needed to answer them. If the CAT is speeded (that is, if substantial numbers of students either do not finish or must guess randomly at the end to finish), individuals who happened to get a disproportionate number of questions that took a long time to answer could be disadvantaged. The purpose of this study was to determine whether--in computer-adaptive testing situations--the administration of a disproportionate number of questions with long, expected response times unfairly disadvantages examinees. Data from 5,957 examinees who took the computerdelivered Graduate Record Examination (GRE@) quantitative measure and 4,745 examinees who took the computer-delivered GRE analytical measure were used to investigate variation in response time in light of other factors, such as mean test score and the position of the question on the test. For both measures, substantial variation in response times was found, even for items with the same specifications and same difficulty level. But despite these differences, there was no indication that the scores of students who were administered items with long expected response times were disadvantaged. Key Words: test fairness validity speededness computer-adaptive tests

5 Contents Introduction... Study... 2 Method... 2 Results... 5 Study 3 I... 9 Method... 9 Results... Conclusion... 5 References... 7 List of Tables Table. Table 2. Table 3. Table 4. Table 5. Table 6. Table 7. Table 8. Number of GRE Quantitative Items in Response-Time Categories by Item Type Number of Easy GRE Quantitative Items in Response-Time Categories by Item Type Number of Medium-Difficulty GRE Quantitative Items in Response-Time Categories by Item Type Number of Difficult GRE Quantitative Items in Response-Time Categories by Item Type Comparison of Two Category- Items Comparison of Two Category-2 Items,...,...,..., Test-Developer Ratings of Response Times for Seven Category- Items Means and Standard Deviations for Examinees Taking Item Sets in Group A, Position 5-9 and Position Table 9. Response Times for Group A and Group E Sets in Position 5-9 and of Comparable Difficulty Table. Number of Logical Reasoning Items in Response-Time Categories by Position and Level of Difficulty List of Figures Figure. Scatterplot of item position by mean response time Figure 2. Scatter-plot of item difficulty by mean response time

6 Introduction Tests used for admissions to programs of professional or graduate education, such as the Graduate Record Examination General Test and Subject Tests, are generally designed to be power tests rather than speed tests. That is, they are intended to be tests of ability rather than how quickly a student can answer. According to the GRE Technical Manual (Briel, O Neill, & Scheuneman, 993), the purpose of the GRE General Test is to assess reasoning skills considered fundamental in graduate study: verbal reasoning, quantitative reasoning, and analytical reasoning (p. 7), and the GRE General and Subject Tests are not intended to be speeded (p. 32). Nevertheless, the tests have strict time limits. Our preliminary analyses showed than 2% of examinees fail to finish the quantitative section of the General Test, and over 35% fail to finish the analytical section. Even examinees who answer every question may still be negatively impacted by the time limit; they may engage in rapid-guessing behavior near the end of the test just so they can answer every question. Such rapid-guessing behavior has been observed on the Test of English as a Foreign Language (Bejar, 985; Yamamoto, 995) as well as on a computer-administered, but nonadaptive, version of the GRE quantitative test (Schnipke & Scrams, 997). Although the addition of a speed component to a power test can be problematic for any type of test, additional unique considerations arise with a computer-adaptive test (CAT). The unidimensional models used with CATS implicitly assume that only knowledge of the correct answer, and not speed, is needed for a high score (Hambleton & Swaminathan, 985). Although this is also true of paper-based power tests, the additional complication with a CAT is that each examinee is getting a different set of questions. For example, on the paper-based version of the GRE, post-administration equating procedures could provide a correction if questions on one form could be answered more quickly than questions on a different form. Although the item response theory (IRT) model used to score the computer-adaptive version of the General Test takes difficulty differences into consideration, it does not adjust for possible differences in the speed demands of different questions or sets of questions. Theoretically, GRE test questions at the same overall difficulty level or meeting roughly the same content specifications could vary substantially in the amount of time needed to answer them. Preliminary results from the experimental, computer-adaptive Graduate Management Admissions Test (GMAT@) suggest that some quantitative question types take twice as long, on average, to answer as other quantitative question types (Bridgeman, Anderson, & Wightman, 997). In this case, average

7 response times across question types varied from a low of 65 seconds for geometry data-sufficiency questions to a high of 35 seconds for algebra problem-solving questions, with standard deviations within question types ranging from 35 to 89 seconds. Variations in mean solution time across question types provide useful information about the cognitive processing demands of different question types, but by themselves they do not raise fairness concerns, as long as the test specifications standardize the number of questions of each type that an individual will receive. On the other hand, substantial variation within question type could signify an equity problem. Because the verbal, quantitative, and analytical sections of the GRE General Test are speeded --that is, substantial numbers of students either do not finish or must guess randomly at the end to finish--individuals who happen to get a disproportionate number of questions that take a long time to answer could be disadvantaged. Although the GMAT analysis suggests that there might indeed be reasons for concern, that study did not explore the data in enough depth to pinpoint the nature and extent of this potential problem. In particular, the analysis did not control for question difficulty within question type, so in this case, it is possible that within-question-type variability could be largely explained by difficulty differences. In the current study, we investigated mean response times on GRE General Test questions while controlling for question difficulty. Two separate analyses are reported here. In Study, we examine CAT items from the GRE quantitative measure. Because 24 out of the 28 questions in the GRE quantitative test are discrete questions, rather than being part of question sets, analyses for this measure were less complex and provided a useful starting point. In Study 2, we look at CAT items from the GRE analytical measure, which makes extensive use of question sets (that is, a series of questions based on a single problem presentation). Study Method The data for the GRE quantitative questions came from a 997 item pool in which 5,957 examinees took CAT versions of the General Test. This particular item pool was administered after a procedure called proportional adjustment for incomplete tests, which imposes a penalty for leaving questions unanswered, was instituted. Therefore, candidates were motivated to complete the test. Examinees had 45 minutes to try to answer 28 questions, or about 96 seconds per question. We extracted mean response times for each item in each position in the test in which it was administered. For example, 2

8 question 2 might be administered as the third item in one person s test and as the 24th item in another person s test, so we computed separate means for each position. If time ran out for an examinee before the last question attempted was answered, we excluded that examinee s time from the computation of the mean time for that question. We also excluded all items that were part of sets (each examinee who finished the test responded to two sets [two question per set] and 24 discrete questions). The item pool consisted of 252 discrete questions. In addition to mean times for all examinees who were administered an item, we computed mean times for those who chose the correct answer to the question and for those who chose an incorrect answer. We also computed mean GRE quantitative scores for examinees who got the question right and for those who got the question wrong. For each item, the database contained the three IRT parameters: discrimination (a), difficulty (b), and pseudo-guessing (c). The b parameter expresses difficulty as a standard score with a mean of and standard deviation of. We classified items into five difficulty categories as follows:. very easy (b < -.5) 2. easy (-.5 c b < -5) 3. medium (-.5 < b c.5) 4. difficult (.5 < b c.5) 5. very difficult (b >.5) Items had been classified by test developers into 6 content categories, which the automated, item-selection algorithm uses to make certain that CATS for all individuals are comparable on these dimensions. The first category level classifies items as either problem solving (PS) questions or quantitative comparison (QC) questions. The PS items are standard multiple-choice questions, each ( which offers five answer choices. The QC items, which offer four answer choices, ask the examinee to determine which of two quantities is larger, whether they are equal, or whether there is not enough information to determine the answer. (Each examinee s test contained PS questions and 4 QC questions, plus the four items from two sets that were excluded from the current analyses.) Interpreting means for skewed distributions can be problematic, and response time distributions tend to be positively skewed. However, the skewness for these items was generally not too severe because examinees knew that spending too much time on any question would impede their ability to finish the test. Thus, means were adequate for our purposes, and we have also supplied medians for certain analyses. 3

9 All questions are also categorized as either pure or real. Pure questions deal only with numbers and symbols, while real questions refer to a name or object from the real world and are frequently word problems. The test specifications indicate that each examinee s test should contain 8 pure items and six real ones. All items are further coded into four subject-matter categories: arithmetic (eight or nine items per examinee), algebra (six or seven items), geometry (five or six items), and data interpretation (two to five items). We assigned numerical codes to these 6 categories as follows:. QC, pure, arithmetic 2. QC, pure, algebra 3. QC, pure, geometry 4. QC, pure, data interpretation 5. QC, real, arithmetic 6. QC, real, algebra 7. QC, real, geometry 8. QC real, data interpretation 9. PS, pure, arithmetic. PS, pure, algebra. PS, pure, geometry 2. PS, pure, data interpretation 3. PS, real, arithmetic 4. PS, real, algebra 5. PS, real, geometry 6. PS, real, data interpretation. Beyond these 6 categories, items are further classified into one of 79 categories that provide a more detailed description of the item content--such as negative exponents, linear inequality, and ratio and proportion. These categories are not used to select items for inclusion on individual CATS; also, there are more of these item-content categories than there are items on any one exarninee s test. However, a few of these categories are used to provide the upper limit of the number of questions of a given content type that should be included on each test; for example, no individual s test should contain more than two ratio and proportion questions, though it need not contain any items in this category at all.

10 Results Position effects. We first looked at the relationship between position in the test and mean latencies. If each of the 252 items appeared in each of the 28 possible positions, there would be 7,56 means representing all of the possible item-by-position possibilities. In fact, however, the operation of the item selection algorithm is such that a given item actually appears in only a limited number of the 28 positions. In this data set,,76 means represented all of the data points. For this and other analyses described here, we required that a mean be based on the performance of at least 2 students; this brought the total number of means to,3. With this constraint, the correlation between mean time to complete the item and item position was not significant (r = -.4). This lack of relationship between position and completion time can be seen in Figure. Similar nonsignificant differences were found for analyses run separately for the PS and QC item types. Furthermore, mean time to correct solution was almost the same for the early PS items (positions -5) as it was for PS items in the last two positions ( seconds vs. 99 seconds, with standard deviations of 74 and 52, respectively). These results do not imply that individual examinees are necessarily all working at a uniform rate, but only that examinees who are responding rapidly at the end are to some extent balanced by examinees who are responding more slowly at the end. Indeed, Scrams and Schnipke (999) suggested that about 23% of examinees speeded up as they proceeded through a linear version of the GRE quantitative test, and about 22% slowed down, with the remainder keeping a relatively even pace throughout. DiJficuZty. As suggested in Figure 2, a correlation of.44 (or.56 including a b2 term to reflect the curvilinear increase) was observed between mean time and difficulty (b). Despite this generally positive relationship between time to answer and difficulty, four of the easiest items (b < -.5) took over seconds, on average, to answer, and four of the hardest items were answered in less than 8 seconds. Item type. Table shows the relationship of item type to mean response time for 244 test items. (Because eight of the 252 questions in the item pool were seen by fewer than 2 examinees, they were excluded from the analyses). Item types -8 (QC items), which are designed to be answered quickly, generally did appear to take less time than the problem solving items. However, there was still substantial variability within the QC and PS categories, with ostensibly quick QC items requiring an average of more than seconds and 2 PS items taking less than a minute.

11 Even within a specific question type there was substantial variation. For example, three items classified as category 2 (QC, pure, algebra) averaged response times of over seconds, while another three items in the same category averaged less than 4 seconds. Variation was especially wide for questions in categories 9 (PS, pure, arithmetic), (PS, pure, algebra), and (PS, pure, geometry). Of the 26 category 9 items, three required an average of over two minutes for examinees to answer, while another five took less than one minute. Similarly, examinees answered two category questions in an average of less than a minute, while they needed an average of more than three minutes for another question in this category. And average response times for items in category also ranged from under one minute to over three minutes. The time differences within question types that display in Table might be explainable by considering differences in item difficulty. But Tables 2, 3, and 4--which show the relationship between item type and mean response time for easy (-.5 < b c -.5), medium-difficulty (-.5 c b < S), and difficult (.5 < b <.5) GRE quantitative items, respectively--indicate that even within a relatively narrow difficulty band, mean times for individual question types still varied over a broad range. However, the trend for more difficult items generally taking longer also was apparent. Only one out of the 59 easy items required more than two minutes, on average, to answer, but 4 out of the 55 difficult items required more than two minutes. Within question category (PS, pure, algebra), four out of six difficult questions averaged more than two minutes, while all of the easy questions in this category averaged less than seconds each to be solved. Nevertheless, it is the within-category/within-difficulty-level variation that is most disturbing from a fairness perspective. For example, if the CAT item-selection algorithm called for a difficult, category question, one examinee might get a question that, on average, students answer in less than seconds, while another student could get a question that, on average, required more than 8 seconds to answer. If, and only if, ample time were allowed to finish the test, would such time differences not be of concern. Table 5 presents a comparison of the characteristics of one such pair of items. The items that display in Table 5 are from the same category, they are of nearly identical difficulty, and both were administered to relatively large samples of examinees. However, the items differ markedly in their mean solution times. Table 6 shows the same phenomenon for a different pair of items.

12 Questions that are of equal difficulty can have very different solution times, because the number of steps needed to solve a problem is not necessarily closely linked to difficulty. For example, Item C on Table 6 is a linear inequality that requires some processing time just to understand what is being asked. Item D, on the other hand, is a negative exponents question, the difficulty of which apparently stems from some examinees not knowing how to solve this type of problem; examinees who understand negative exponents can solve it quite quickly. The potential fairness implications of these results can be clearly seen by imagining the difference in the testing experience of two hypothetical examinees based on a lucky or unlucky break in item selection. Suppose the two examinees, call them Mary and Jim, took identical tests, except that the questions administered to each in positions 5 and 8 were different. If Mary got the B and C pair while Jim got the A and D pair (and both students got the correct answer in the average amount of time), Jim would have almost three more minutes to complete the test than Mary. A possible solution to this problem would be to include a category for solution time in the itemselection algorithm. This would ensure that no individual would get a disproportionate number of questions that require long or short response-times. This, of course, leads to the question of the availability of solution times. For items that are pretested and calibrated during a CAT administration, solution times are available. However, a significant proportion of GRE test items that are used to create new item pools are calibrated in paper-and-pencil administrations, and so no solution times are available. For these items, expert ratings of estimated solution times could be obtained. To evaluate the likely success of this procedure, we asked three people with considerable experience in developing items for the GRE quantitative measure to rank order a set of seven items from shortest time to correct solution to longest time to correct solution. The seven items were all of the same c category (category : PS, PURE, Algebra) and difficulty level (medium: -.5 c b < OS), and all had been answered correctly by at least 45 examinees. Mean and median times, along with the rater rankings, are shown in Table 7. As Table 7 shows, median times are about seconds shorter than mean times because of the previously noted positive skew of the time distributions, but both mean and median times tell the same story as to which items take the longest to solve. The rankings by the test development experts were reasonably close to the rank order of actual solution times. The item with the shortest actual solution time was ranked in the top three shortest time categories by all three test developers, and the item with the longest actual solution time was rated as the longest item by one rater and as the third longest item by the 7

13 other two raters. However, there were also some misclassifications. For example, all three raters placed the actual fourth-place question in sixth place. Two out of the three raters thought that this item (which actually took 9 seconds, on average, to answer) would take longer to answer than the item that averaged 23 seconds. Thus, although useful and certainly better than no solution time estimates at all, ratings by test developers would not substitute for actual solution times. Although examinees who happened to get an item that took much longer than average to answer would seem to be at a disadvantage, we saw no evidence of this here in terms of total GRE quantitative scores. Looking back at Tables 5 and 6, note that in both comparisons, mean quantitative scores were slightly higher for the students who answered the longer item in each pair--exactly the opposite of the lower scores that would be expected if this were to hurt their chances to fully consider later items. We correlated the mean time needed to respond to each item with the mean score of the examinees who took that item separately for the two main item types, QC and PS.. Other things being equal, if having to take an item with a longer response time lowered scores, this correlation might be expected to be negative. However, both correlations were positive (.37 for the 34 QC items and 55 for the PS items). And of course, other things are not equal in a CAT. More difficult items, which tend to take longer, are administered disproportionately to higher ability students; hence the positive correlation. We attempted to compensate for this with a regression approach, predicting the mean score of the examinees taking the item from item difficulty (IRT parameter b), and then determining whether adding mean time made any incremental contribution to the prediction. Difficulty was indeed substantially correlated with mean score (82 for QC and.88 for PS), but mean time did not make a significant incremental contribution to the prediction of mean scores (multiple R increased by less than.ol). Gender diflerences. Tables 5 and 6 illustrate the potential for fairness problems from an individual perspective. Because the item selection algorithm would not systematically assign timeconsuming questions to a particular gender group, there is less concern that these time differences would have an impact on fairness from a group perspective. Nevertheless, we attempted to determine whether particular items could be answered more quickly by one gender group than the other. Over the 252 items in the pool, we correlated the mean time to a correct answer for men with the mean time to a correct answer for women. The correlation was.92, suggesting that in general, the items that were most time consuming for men were also the items that were most time consuming for women. For the 6 iterns that were answered correctly by at least men and women, we computed the 8

14 differences in mean times for the two gender groups. Only 8 items demonstrated differences of more than 2 seconds; men were faster for seven of these eight items. This result is not surprising, given the higher mean scores of men in this sample (for men, mean = 596 [SD = 29; for women, mean = 524 [SD = 23). A closer look at the one item that women answered more quickly than men illustrates the difficulty in interpreting time differences on a multiple-choice examination. The item that women, on average, answered more quickly was a very difficult (b =.52), category 3 (QC, pure, geometry) problem. The mean time to a correct solution was 74 seconds for men (SD = 94; median = 48) and 4 seconds for women (SD = 92; median = 2). But only of the 592 women (9%) who were administered this item answered it correctly, which is below the random guess rate of 25% for an item with four answer choices. (For men the comparable figures were 5 out of 332, or 32%.) The mean GRE quantitative score of men who answered this item correctly was 76, while the mean score of men who answered it incorrectly was 66; for women, the comparable scores were 595 and 55, respectively. Thus, the women who got this item right were at about the same overall level as men who got it wrong. One possible interpretation is that women were faster, on average, in this case because they gave up sooner and made a random guess. Relatively short times to a correct solution on a multiple-choice test may reflect either a high degree of mastery--or nonmastery with lucky guessing. Study 2 Method Data for the GRE analytical questions--the subject of Study 2--were obtained from a CAT item pool that was administered in 998. For the analytical section, examinees had 6 minutes to answer 35 questions. The analytical section has questions of two types. One type, logical reasoning (LR), consists of discrete questions which test the ability to understand, analyze, and evaluate arguments... [including] recognizing the point of an argument, recognizing assumptions on which an argument is based, drawing conclusions and formulating hypotheses, identifying methods of argument, evaluating arguments and counterarguments, and evaluating evidence (Briel, O Neill, & Scheuneman, 993, p. ). Each item or group of logical reasoning items is based on a short argument or on a simple graph or table, generally an excerpt from the kind of material graduate students are likely to encounter in their academic and personal reading (pp. -l ).

15 The second major category of items in the GRE analytical measure, analytical reasoning (AR), consists of item sets which test:... the ability to understand a given structure of arbitrary relationships among fictitious persons, places, things or events, and to deduce information from the relationships given. Each analytical reasoning group consists of () a set of approximately three to seven related statements or conditions (and sometimes other explanatory material) describing the structure of relationships, and (2) three or more items that test understanding of that structure and its implications by requiring examinees to infer what is necessarily true or to determine what is or is not possible. (Briel, O Neill, & Scheuneman, 993, p. ) A given examinee is administered nine discrete LR items, plus 26 AR items arranged in six sets (four four-item sets and two five-item sets). Each AR grouping consists of a problem stimulus with four to eight associated questions, but any individual examinee would see only four or five of these questions, and each would also likely see a different combination of questions. For example, one examinee might see only questions, 3, 57, and 8 from a given group, while another examinee would see only questions 2, 3,4,6, and 7. Two examinees may see the same items based on the same stimulus, but they could receive them in a different order. In order to simplify the analyses, we decided to study in detail only the groups that were administered to the largest number of examinees. Five of these groups were associated with four-item AR sets, and the remaining five were associated with five-item AR sets. Results Position efsects. We first evaluated the two, five-item, AR sets that were administered to each examinee. Typically, an examinee would receive one five-item AR set in position 5-9 and the other fiveitem set in position Consider one problem statement (call it Stimulus A) with eight attached questions. Stimulus A might be used in position 5-9 for some exarninees and position for others. In either position, a given examinee would be administered only 5 of the 8 possible questions, so theoretically, 56 possible combinations of items could be generated from this one problem statement. However, in practice, a relatively small number of combinations accounted for all of the patterns actually administered. In position 5-9,,9 examinees received questions from Stimulus A. Four combinations of items (call them sets Al-A4) accounted for all but 365 of these examinees. In position 25-29, only 832 examinees were administered questions from Stimulus A, but three of the same sets (Al, A2, and A4) accounted for all but 99 examinees. Set A3 was administered to 3 examinees in position 5-9 but to only 4 examinees in position

16 Table 8 shows the mean test scores and solution times for sets Al-A4 in both positions. We included GRE quantitative score in the table because it is correlated with GRE analytical score (Y =.68), but would itself be uninfluenced by performance on any of these sets. In all four sets, the average amount of time spent on the set was much shorter in the position than in the 5-9 position. Consider set A4, which was seen by relatively large samples of examinees in both positions. The samples of students who were administered this set in the 5-9 and positions were very comparable in terms of their mean GRE analytical and quantitative scores, yet, the sample of examinees who took this set in the latter position spent six minutes less to answer it than the sample who took it in the earlier position. This result could reflect a learning effect that would allow later sets to be answered more quickly, or it could reflect hurrying because time was running short near the end of the test. A learning effect would suggest that the number of items answered correctly in the set should be higher for those who were administered the set in the latter position, while a hurrying effect would suggest that fewer items should be answered correctly. As indicated in the last column of Table 8, for A4--and indeed for every set--the number correct was lower when the set was administered in the later position, suggesting a substantial hurrying effect. A similar pattern was noted with all of the other five-item sets studied. The lack of position effects for the quantitative items and strong position effects for the AR items could result from one or more differences between the two tests. Because AR is more speeded, greater position effects would be expected. In addition, the task requirements of AR items may produce greater time flexibility. AR sets require checking proposed solutions against a complex set of task requirements. If time were running short, some of these checks could be skipped. However, in a quantitative problem, it would be much more difficult to omit steps and still hope to get a correct answer; hence solution time would be relatively constant whether hurried or not. Time differences for paired sets. We paired five-item AR sets that were administered in the same position (5-9), were of comparable difficulty (similar b values), and were administered to examinee groups of comparable ability (as measured by GRE quantitative score). We then compared the mean time examinees took to complete each pair of sets. Three sets from Group A were paired with three sets from Group E. Statistics for the three A-E pairs are presented in Table 9. The table shows the IRT b parameter for each question in each set listed in order from lowest to highest. This is not necessarily the order in which the items were administered; there are many different administration orders possible for each set. For each pair, response times were shortest for the set from Group E.

17 The most closely matched pair was A4-E7, with mean b values within.5 points and mean test scores within 5 points (on the 2-8 GRE scale). Yet, on average, it took 78 seconds longer for examinees to respond to set A4 than to set E7. An extra 78 seconds to spend on the rest of the test could provide a significant advantage on an examination as speeded as the GRE analytical test. However, such an advantage w.as not evident in the current data, as mean analytical scores were not systematically higher for examinees who took items from Group E. Nevertheless, individuals who were unlucky enough to have to take several long sets could still be disadvantaged relative to individuals who had several short sets. (This issue is addressed more fully in a later section.) In an ordinary, linear test, students of higher ability would generally be expected to be faster-- especially when speed is an seen as an inherent feature of the construct (as it is for AR items). A linear test is then often more speeded for lower ability students--that is, they have greater difficulty finishing in the time allowed. However, as Table 9 shows, in this testing situation higher ability students (on the bottom of the table) take substantially longer than the lower ability students (on the top of the table), because higher ability students are administered more difficult items. Thus, in a CAT, the usual relationship between ability and speededness can be totally reversed so that the test is more speeded for higher ability students. Similar analyses of the four-item AR sets provided further evidence that some sets can be answered much more quickly than others. The mostly closely matched pair of four-item sets, in terms of mean test scores, was administered in position 6-9. Mean GRE quantitative scores for the 225 examinees that comprised one group of the pair were within two points of the mean for the 65 examinees who made up the other group (653 and 655, with standard deviations of 96 and 4), but the mean response-time for one group was over two minutes longer than mean response-time of the other group (473 seconds vs. 597 seconds, with standard deviations of 45 and 53, respectively). The mean GRE analytical scores of these groups were very similar, 664 and 658 (SDS = 9 and 97) for the shorter response-time and longer response-time groups, respectively. Time differences for LR items. For the LR items, which are not administered in sets, Table shows the spread of mean latencies for questions with approximately the same difficulty level and position in the test. For example, the first line of the table is for very easy questions ( b less than -.5) that were administered in positions l-4. Of the seven questions that met these conditions, average response times were 6-8 seconds for two items, 8- seconds for four items, and to 2 seconds for one item. The next row of the table refers to questions administered in positions or 5. Most of the 2

18 items administered in one position were also administered in a different position, so that within a difficulty range, the same items may appear on more than one row in the table. Some items were not included in certain positions because they did not meet the minimum standard that means be based on at least 2 examinees. At each difficulty level, response times were notably shorter for positions 3 and 35, indicating that at this point examinees were hurrying to complete the test. For each position in the test, response times were longer for more difficult questions. For the easy and very easy questions in position l-4, only 5 out of 2 questions (25%) took over seconds to complete, but for questions of medium-difficulty or harder (b c -.5), 27 out of 4 (68%) questions took over seconds to complete. For the easy and very easy items, the range of mean response times in a given position was fairly narrow; at each position at both of these difficulty levels, two adjacent time categories were sufficient to describe almost all of the items. However, there was a considerably greater range in the mean times for the more difficult items-- even within a given position in the test. For example, mean times for the 3 items with b values over.5 administered in positions l-4 ranged over six 2-second categories. Two questions in this difficulty level had mean times greater than 4 seconds, and five items had mean times under seconds. The same pattern of decreasing times with later positions in the test that was noted for the less difficult items was also found for these difficult items. Zmpact of long tests on totae scores. This range of mean times led again to the suspicion that examinees who had more items with long mean times might be at a disadvantage. To test this, we first determined the mean response time for each item. Because some items were administered more frequently later in the test, and because items taken later are answered more quickly, the mean time for an item was defined as the unweighted average of the times across positions in the test. Thus, mean time for an item was independent of whether it happened to be administered primarily early or primarily late in the test. For the first item in each AR set, an adjustment for time to read the stimulus was created by comparing the time to answer an item when it was first in the set to the time for the same item when it occurred later in the set. Next, we computed the expected mean time for each individual s test by summing the mean times for each item taken by that individual. Our hypothesis was that, after controlling for general ability (using GRJ? quantitative score and GRE verbal score), GRE analytical scores would be lower for examinees with the longest expected times. To test this, we ran a regression with GRE analytical score as the dependent variable, entering GRE quantitative score and GRE verbal score at the first step, and expected time at the second step. 3

19 Our hypothesis was not supported. Indeed, the opposite was true. Expected time had a significant positive weight, and increased R-square from 52 to.66. Standardized weights were.3,.6, and 5 for GRE quantitative score, GRE verbal score, and expected time, respectively. Thus, examinees who took tests that should have taken longer got higher scores. Next, we looked to see if the expected negative relationship might emerge within a narrow ability range. We selected only the 862 examinees with GRE verbal plus GRE quantitative scores in the,3-,6 range. Within this group, as expected, GRE verbal score and GRE quantitative score were no longer significant predictors of GRE analytical score, but expected time still had a substantial positive weight, increasing R-square from.oo to.36. Similar results were found for a low-scoring group (65-7 range) and a high-scoring group (,3-,35 range), and for analyses run separately for expected times on sets and on discrete items. Apparently, these positive weights for expected time remained because of the relationship between item difficulty and expected time (r =.52)--that is, more difficult items take longer. With a CAT, more difficult items are administered to higher ability examinees, so higher ability examinees generally get tests that take longer. Adjusting for GRE verbal score and GRE quantitative score, which were correlated.7 with GRE analytical score (a substantial correlation, but still leaving half of the variance unexplained), was not enough to overcome this relationship of time to item difficulty and of item difficulty to test score. In an attempt to compensate for the relationship of time to difficulty, we used the IRT b parameter to predict mean response time separately for LR and AR items. Each item then had a predicted time associated with it. We computed a time discrepancy score for each item as the difference between the time predicted from the item s difficulty and the actual mean time. The time discrepancy score for an individual was the sum of the time discrepancy scores for all of the items taken by that individual. A high time discrepancy score would then reflect a test that was especially long, taking difficulty into account. For both the LR and AR items, the time discrepancy score was entered after GRE quantitative score and GRE verbal score, with GRE analytical score as the dependent variable. The time discrepancy score had a small positive weight, which did not contribute significantly (p >.5) to the prediction, even with the sample of over, examinees. Thus, there was no evidence that examinees who got long tests, taking item difficulty into account, were disadvantaged in terms of their total scores. For the above analyses, expected times were based on the full set of 35 items, and examinees with incomplete tests were excluded from the analysis. An additional analysis was run that was able to 4

20 include all examinees by defining a long test only in terms of the first nine items (4 LR items and the first five-item AR set). Mean item times were recomputed to reflect only performance on these items in positions l-9, and as before, time discrepancy scores were defined as the difference between the time predicted from the item s difficulty and the actual mean time. The time discrepancy score for an individual was the sum of the time discrepancy scores for each of the items (l-9) taken by that individual. The distribution of these summed time discrepancy scores for individuals indicated a difference of about one minute between times at the 25th and 75th percentiles. This summed time discrepancy score had a small negative weight for predicting the number of questions attempted after GRE verbal score and GRE quantitative score were entered; though statistically significant in this very large sample of 4,745 examinees, the change in R was very small (.5 to.66, or a change in R-square from.ooo to.4). The more important question was whether the summed time discrepancy score had a significant negative weight for predicting the GRE analytical score. It did not. It had a positive weight; although statistically significant, the change in R-square was infinitesimal (from.459 to.46). This analysis was repeated separately for male and female examinees, and for African American, Asian, Hispanic/Latino, and White subgroups. Results were remarkably consistent across these subgroups with a small positive weight in each subgroup. Conclusion For both the GRE quantitative and analytical measures, some items clearly take longer to answer than others. Because estimated solution time is not included in the item selection algorithm, this difference in time could potentially create a fairness problem on a timed test. However, we could find no evidence of an impact on total scores for examinees who got unusually long questions or tests. Despite this lack of an overall effect on test scores, we cannot rule out the possibility that certain individuals could be negatively impacted by receiving items that take an unusually long time to answer. Even without clear evidence of an impact on test scores, it might be desirable to include some measure of estimated solution time in the item selection algorithm so that no individual gets more than a fair share of time-consuming items. The feasibility of such an approach has already been demonstrated (van der Linden, Scrams, & Schnipke, 999). However, the current results suggest that including such estimates may be more complicated than previously imagined, because an item, or a set of items, does not have a single response time, but rather 5

21 many response times, depending on the position in which it is administered in the test. This is especially true for AR items. Suppose the time needed for a five-question AR set were estimated from a pretest that administered the set near the end of the test, but the set was actually administered to some examinees near the beginning of the test. The estimated time could be in error by more than six minutes. Knowledge of position effects might lead to appropriate adjustments in estimated time depending on position. In addition to clarifying these position effects, future research could investigate possible relationships between individual differences in pacing styles (of the type described by Scrams and Schnipke, 999) and the time demands created by particular sets of items. 6

22 References Bejar, I. I. (985). Test sneededness under number-right scoring: An analvsis of the Test of English as a Foreign Language (Research Report RR-85 ). Princeton, NJ: Educational Testing Service. Briel, J. B., O Neill, K. A., & Scheuneman, J. D. (993). GRE technical manual. Princeton, NJ: Educational Testing Service. Bridgeman, B., Anderson, D., & Wightman, L. (997). Overview of the GMAT CAT Pilot Test. Unpublished report. Hambleton, R. K., & Swaminathan, H. (985). Item resnonse theory Principles and annlications. Boston: Kluwer-Nijhoff. Schnipke, D. L., & Scrams, D. J. (997). Modeling item response times with a two-state mixture model: A new method of measuring speededness. Journal of Educational Measurement, 34, Scrams, D. J., & Schnipke, D. L.( 999). Response time feedback on computer-administered tests. Paper presented at the annual meeting of the National Council on Measurement in Education, Montreal. van der Linden, W. J., Scrams, D. J., & Schnipke, D. L. (999). Using response-time constraints to control for differential speededness in computerized adaptive testing. Applied Psvchological Measurement, 23, 952. Yamamoto, K. (995). Estimating the effects of test length and test time on parameter estimates using the HYBRID model (TOEFL Technical Report No. TR-). Princeton, NJ: Educational Testing Service. 7

23 Table. Number of GRE Quantitative Items in Response-Time Categories by Item Type Mean Time in 2-Second Intervals Item loo Type >8 Total QC,P,Arith QC,P,Alg QC,P,Geom QC,P,DI QC,R,Arith QC,R,Alg QC,R,Geom QC,R,DI PS,P,Arith PS,P,Alg PS,P,Geom PS,P,DI PS,R,Arith PS,R,Alg PS,R,Geom PS,R,DI Total Note. QC = quantitative comparison; PS = problem solving; P = pure (numbers only); R = real; DI = data interpretation. 8

24 Table 2. Number of Easy (-ls<bc-.5) GRE Quantitative Items in Response-Time Categories by Item Type Item Type Mean Time in 2-Second Intervals loo >8 Total QC,P,Arith QC,P,Alg QC,P,Geom. 4 QC,P,DI 5 QC,R,Arith QC,R,Alg QC,R,Geom. 8 QC,R,DI 9 PS,P,Arith. PS,P,Alg. PS,P,Geom PS,P,DI 2 3 PS,R,Arith. 4 PS,R,Alg. 5 PS,R,Geom PS,R,DI 2 3 Total Note. Item type: QC = quantitative comparison; PS = problem solving; P = pure; R = real; DI = data interpretation. 9

25 Table 3. Number of Medium-Difficulty ( -.5 c b <.5 ) GRE Quantitative Items in Response-Time Categories by Item Type Item TvDe Mean Time in 2-Second Intervals o loo >8 Total QC,P,Arith QC,P,Alg QC,P,Geom QC,P,DI QC,R,Arith. 6 QC,R,Alg. 7 QC,R,Geom. 8 QC,R,DI PS,P,Arith. 2 7 PS,P,Alg. 7 PS,P,Geom. 2 (PS,P,DI) PS,R,Arith. 4 PS,R,Alg. 3 5 PS,R,Geom PS,R,DI 2 Total Note. Item type: QC = quantitative comparison; PS = problem solving; P = pure; R = real; DI = data interpretation. 2

26 Table 4. Number of Difficult (.5 c b <.5) GRE Quantitative Items in Response-Time Categories by Item Type Mean Time in 2-Second Intervals Item loo Type >8 Total (QC,P,Arith.) 2 (QCRAlg.) 3 (QC,P,Geom.) 4 (QW DU 5 (QC,R,Arith.) 6 (QCJWg.) 7 (QC,R,Geom.) 8 (QWUW 9 (PS,P,Arith.) (PS,P,Alg.) (PS,P,Geom.) 2 (PS,P,DI) 3 (PS,R,Arith.) 4 (PS,R,Alg.) 5 (PS,R,Geom.) 6 (PS,R,DI) Total Note. Item type: QC = quantitative comparison; PS = problem solving; P = pure; R = real; DI = data interpretation. 2

27 Table 5. Comparison of Two Category- Items Comparison Characteristics Item A Item B Difficulty (IRT b parameter) Number answering correctly Number answering incorrectly Mean GRE-quantitative score for examinees answering correctly Mean GRE-quantitative score for examinees answering (69) 645 (7) 544 (74) 555 (63) incorrectly Mean time, in seconds, to correct answer Mean time, in seconds, to wrong answer Mean time, correct, position 5 (n = 68 for A and 34 for B) Mean time, wrong, position 5 (n = 6 for A and 89 for B) 58 (33) 53 (89) 7 (48) 62 (89) 53 (25) 62 (75) 69 (4) 6 (88) Note. Category : PS, pure, algebra. 22

Unit 1 Exploring and Understanding Data

Unit 1 Exploring and Understanding Data Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile

More information

CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA

CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA Data Analysis: Describing Data CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA In the analysis process, the researcher tries to evaluate the data collected both from written documents and from other sources such

More information

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison Empowered by Psychometrics The Fundamentals of Psychometrics Jim Wollack University of Wisconsin Madison Psycho-what? Psychometrics is the field of study concerned with the measurement of mental and psychological

More information

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing Terry A. Ackerman University of Illinois This study investigated the effect of using multidimensional items in

More information

Samantha Sample 01 Feb 2013 EXPERT STANDARD REPORT ABILITY ADAPT-G ADAPTIVE GENERAL REASONING TEST. Psychometrics Ltd.

Samantha Sample 01 Feb 2013 EXPERT STANDARD REPORT ABILITY ADAPT-G ADAPTIVE GENERAL REASONING TEST. Psychometrics Ltd. 01 Feb 2013 EXPERT STANDARD REPORT ADAPTIVE GENERAL REASONING TEST ABILITY ADAPT-G REPORT STRUCTURE The Standard Report presents s results in the following sections: 1. Guide to Using This Report Introduction

More information

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Thakur Karkee Measurement Incorporated Dong-In Kim CTB/McGraw-Hill Kevin Fatica CTB/McGraw-Hill

More information

Mantel-Haenszel Procedures for Detecting Differential Item Functioning

Mantel-Haenszel Procedures for Detecting Differential Item Functioning A Comparison of Logistic Regression and Mantel-Haenszel Procedures for Detecting Differential Item Functioning H. Jane Rogers, Teachers College, Columbia University Hariharan Swaminathan, University of

More information

BACKGROUND CHARACTERISTICS OF EXAMINEES SHOWING UNUSUAL TEST BEHAVIOR ON THE GRADUATE RECORD EXAMINATIONS

BACKGROUND CHARACTERISTICS OF EXAMINEES SHOWING UNUSUAL TEST BEHAVIOR ON THE GRADUATE RECORD EXAMINATIONS ---5 BACKGROUND CHARACTERISTICS OF EXAMINEES SHOWING UNUSUAL TEST BEHAVIOR ON THE GRADUATE RECORD EXAMINATIONS Philip K. Oltman GRE Board Professional Report GREB No. 82-8P ETS Research Report 85-39 December

More information

A Comparison of Several Goodness-of-Fit Statistics

A Comparison of Several Goodness-of-Fit Statistics A Comparison of Several Goodness-of-Fit Statistics Robert L. McKinley The University of Toledo Craig N. Mills Educational Testing Service A study was conducted to evaluate four goodnessof-fit procedures

More information

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories Kamla-Raj 010 Int J Edu Sci, (): 107-113 (010) Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories O.O. Adedoyin Department of Educational Foundations,

More information

Technical Specifications

Technical Specifications Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically

More information

A Modified CATSIB Procedure for Detecting Differential Item Function. on Computer-Based Tests. Johnson Ching-hong Li 1. Mark J. Gierl 1.

A Modified CATSIB Procedure for Detecting Differential Item Function. on Computer-Based Tests. Johnson Ching-hong Li 1. Mark J. Gierl 1. Running Head: A MODIFIED CATSIB PROCEDURE FOR DETECTING DIF ITEMS 1 A Modified CATSIB Procedure for Detecting Differential Item Function on Computer-Based Tests Johnson Ching-hong Li 1 Mark J. Gierl 1

More information

Risk Aversion in Games of Chance

Risk Aversion in Games of Chance Risk Aversion in Games of Chance Imagine the following scenario: Someone asks you to play a game and you are given $5,000 to begin. A ball is drawn from a bin containing 39 balls each numbered 1-39 and

More information

Bayesian Tailored Testing and the Influence

Bayesian Tailored Testing and the Influence Bayesian Tailored Testing and the Influence of Item Bank Characteristics Carl J. Jensema Gallaudet College Owen s (1969) Bayesian tailored testing method is introduced along with a brief review of its

More information

Results & Statistics: Description and Correlation. I. Scales of Measurement A Review

Results & Statistics: Description and Correlation. I. Scales of Measurement A Review Results & Statistics: Description and Correlation The description and presentation of results involves a number of topics. These include scales of measurement, descriptive statistics used to summarize

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 39 Evaluation of Comparability of Scores and Passing Decisions for Different Item Pools of Computerized Adaptive Examinations

More information

Student Performance Q&A:

Student Performance Q&A: Student Performance Q&A: 2009 AP Statistics Free-Response Questions The following comments on the 2009 free-response questions for AP Statistics were written by the Chief Reader, Christine Franklin of

More information

Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data

Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data TECHNICAL REPORT Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data CONTENTS Executive Summary...1 Introduction...2 Overview of Data Analysis Concepts...2

More information

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES Correlational Research Correlational Designs Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are

More information

The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests

The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests Mary E. Lunz and Betty A. Bergstrom, American Society of Clinical Pathologists Benjamin D. Wright, University

More information

Description of components in tailored testing

Description of components in tailored testing Behavior Research Methods & Instrumentation 1977. Vol. 9 (2).153-157 Description of components in tailored testing WAYNE M. PATIENCE University ofmissouri, Columbia, Missouri 65201 The major purpose of

More information

Detection of Differential Test Functioning (DTF) and Differential Item Functioning (DIF) in MCCQE Part II Using Logistic Models

Detection of Differential Test Functioning (DTF) and Differential Item Functioning (DIF) in MCCQE Part II Using Logistic Models Detection of Differential Test Functioning (DTF) and Differential Item Functioning (DIF) in MCCQE Part II Using Logistic Models Jin Gong University of Iowa June, 2012 1 Background The Medical Council of

More information

CHARACTERISTICS OF EXAMINEES WHO LEAVE QUESTIONS UNANSWERED ON THE GRE GENERAL TEST UNDER RIGHTS-ONLY SCORING

CHARACTERISTICS OF EXAMINEES WHO LEAVE QUESTIONS UNANSWERED ON THE GRE GENERAL TEST UNDER RIGHTS-ONLY SCORING CHARACTERISTICS OF EXAMINEES WHO LEAVE QUESTIONS UNANSWERED ON THE GRE GENERAL TEST UNDER RIGHTS-ONLY SCORING Jerilee Grandy GRE Board Professional Report No. 83-16P ETS Research Report 87-38 November

More information

On the purpose of testing:

On the purpose of testing: Why Evaluation & Assessment is Important Feedback to students Feedback to teachers Information to parents Information for selection and certification Information for accountability Incentives to increase

More information

Convergence Principles: Information in the Answer

Convergence Principles: Information in the Answer Convergence Principles: Information in the Answer Sets of Some Multiple-Choice Intelligence Tests A. P. White and J. E. Zammarelli University of Durham It is hypothesized that some common multiplechoice

More information

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments Greg Pope, Analytics and Psychometrics Manager 2008 Users Conference San Antonio Introduction and purpose of this session

More information

C-1: Variables which are measured on a continuous scale are described in terms of three key characteristics central tendency, variability, and shape.

C-1: Variables which are measured on a continuous scale are described in terms of three key characteristics central tendency, variability, and shape. MODULE 02: DESCRIBING DT SECTION C: KEY POINTS C-1: Variables which are measured on a continuous scale are described in terms of three key characteristics central tendency, variability, and shape. C-2:

More information

Running head: INDIVIDUAL DIFFERENCES 1. Why to treat subjects as fixed effects. James S. Adelman. University of Warwick.

Running head: INDIVIDUAL DIFFERENCES 1. Why to treat subjects as fixed effects. James S. Adelman. University of Warwick. Running head: INDIVIDUAL DIFFERENCES 1 Why to treat subjects as fixed effects James S. Adelman University of Warwick Zachary Estes Bocconi University Corresponding Author: James S. Adelman Department of

More information

Chapter 5: Field experimental designs in agriculture

Chapter 5: Field experimental designs in agriculture Chapter 5: Field experimental designs in agriculture Jose Crossa Biometrics and Statistics Unit Crop Research Informatics Lab (CRIL) CIMMYT. Int. Apdo. Postal 6-641, 06600 Mexico, DF, Mexico Introduction

More information

Sheila Barron Statistics Outreach Center 2/8/2011

Sheila Barron Statistics Outreach Center 2/8/2011 Sheila Barron Statistics Outreach Center 2/8/2011 What is Power? When conducting a research study using a statistical hypothesis test, power is the probability of getting statistical significance when

More information

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, April 23-25, 2003 The Classification Accuracy of Measurement Decision Theory Lawrence Rudner University

More information

Computerized Mastery Testing

Computerized Mastery Testing Computerized Mastery Testing With Nonequivalent Testlets Kathleen Sheehan and Charles Lewis Educational Testing Service A procedure for determining the effect of testlet nonequivalence on the operating

More information

Measuring the User Experience

Measuring the User Experience Measuring the User Experience Collecting, Analyzing, and Presenting Usability Metrics Chapter 2 Background Tom Tullis and Bill Albert Morgan Kaufmann, 2008 ISBN 978-0123735584 Introduction Purpose Provide

More information

A Broad-Range Tailored Test of Verbal Ability

A Broad-Range Tailored Test of Verbal Ability A Broad-Range Tailored Test of Verbal Ability Frederic M. Lord Educational Testing Service Two parallel forms of a broad-range tailored test of verbal ability have been built. The test is appropriate from

More information

2.75: 84% 2.5: 80% 2.25: 78% 2: 74% 1.75: 70% 1.5: 66% 1.25: 64% 1.0: 60% 0.5: 50% 0.25: 25% 0: 0%

2.75: 84% 2.5: 80% 2.25: 78% 2: 74% 1.75: 70% 1.5: 66% 1.25: 64% 1.0: 60% 0.5: 50% 0.25: 25% 0: 0% Capstone Test (will consist of FOUR quizzes and the FINAL test grade will be an average of the four quizzes). Capstone #1: Review of Chapters 1-3 Capstone #2: Review of Chapter 4 Capstone #3: Review of

More information

Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida

Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida and Oleksandr S. Chernyshenko University of Canterbury Presented at the New CAT Models

More information

Appendix B Statistical Methods

Appendix B Statistical Methods Appendix B Statistical Methods Figure B. Graphing data. (a) The raw data are tallied into a frequency distribution. (b) The same data are portrayed in a bar graph called a histogram. (c) A frequency polygon

More information

EVALUATING AND IMPROVING MULTIPLE CHOICE QUESTIONS

EVALUATING AND IMPROVING MULTIPLE CHOICE QUESTIONS DePaul University INTRODUCTION TO ITEM ANALYSIS: EVALUATING AND IMPROVING MULTIPLE CHOICE QUESTIONS Ivan Hernandez, PhD OVERVIEW What is Item Analysis? Overview Benefits of Item Analysis Applications Main

More information

Chapter 7: Descriptive Statistics

Chapter 7: Descriptive Statistics Chapter Overview Chapter 7 provides an introduction to basic strategies for describing groups statistically. Statistical concepts around normal distributions are discussed. The statistical procedures of

More information

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups GMAC Scaling Item Difficulty Estimates from Nonequivalent Groups Fanmin Guo, Lawrence Rudner, and Eileen Talento-Miller GMAC Research Reports RR-09-03 April 3, 2009 Abstract By placing item statistics

More information

3 CONCEPTUAL FOUNDATIONS OF STATISTICS

3 CONCEPTUAL FOUNDATIONS OF STATISTICS 3 CONCEPTUAL FOUNDATIONS OF STATISTICS In this chapter, we examine the conceptual foundations of statistics. The goal is to give you an appreciation and conceptual understanding of some basic statistical

More information

Lesson 1: Distributions and Their Shapes

Lesson 1: Distributions and Their Shapes Lesson 1 Name Date Lesson 1: Distributions and Their Shapes 1. Sam said that a typical flight delay for the sixty BigAir flights was approximately one hour. Do you agree? Why or why not? 2. Sam said that

More information

Statistical analysis DIANA SAPLACAN 2017 * SLIDES ADAPTED BASED ON LECTURE NOTES BY ALMA LEORA CULEN

Statistical analysis DIANA SAPLACAN 2017 * SLIDES ADAPTED BASED ON LECTURE NOTES BY ALMA LEORA CULEN Statistical analysis DIANA SAPLACAN 2017 * SLIDES ADAPTED BASED ON LECTURE NOTES BY ALMA LEORA CULEN Vs. 2 Background 3 There are different types of research methods to study behaviour: Descriptive: observations,

More information

Recent developments for combining evidence within evidence streams: bias-adjusted meta-analysis

Recent developments for combining evidence within evidence streams: bias-adjusted meta-analysis EFSA/EBTC Colloquium, 25 October 2017 Recent developments for combining evidence within evidence streams: bias-adjusted meta-analysis Julian Higgins University of Bristol 1 Introduction to concepts Standard

More information

Learning with Rare Cases and Small Disjuncts

Learning with Rare Cases and Small Disjuncts Appears in Proceedings of the 12 th International Conference on Machine Learning, Morgan Kaufmann, 1995, 558-565. Learning with Rare Cases and Small Disjuncts Gary M. Weiss Rutgers University/AT&T Bell

More information

Political Science 15, Winter 2014 Final Review

Political Science 15, Winter 2014 Final Review Political Science 15, Winter 2014 Final Review The major topics covered in class are listed below. You should also take a look at the readings listed on the class website. Studying Politics Scientifically

More information

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati.

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati. Likelihood Ratio Based Computerized Classification Testing Nathan A. Thompson Assessment Systems Corporation & University of Cincinnati Shungwon Ro Kenexa Abstract An efficient method for making decisions

More information

Implicit Information in Directionality of Verbal Probability Expressions

Implicit Information in Directionality of Verbal Probability Expressions Implicit Information in Directionality of Verbal Probability Expressions Hidehito Honda (hito@ky.hum.titech.ac.jp) Kimihiko Yamagishi (kimihiko@ky.hum.titech.ac.jp) Graduate School of Decision Science

More information

Influences of IRT Item Attributes on Angoff Rater Judgments

Influences of IRT Item Attributes on Angoff Rater Judgments Influences of IRT Item Attributes on Angoff Rater Judgments Christian Jones, M.A. CPS Human Resource Services Greg Hurt!, Ph.D. CSUS, Sacramento Angoff Method Assemble a panel of subject matter experts

More information

The role of sampling assumptions in generalization with multiple categories

The role of sampling assumptions in generalization with multiple categories The role of sampling assumptions in generalization with multiple categories Wai Keen Vong (waikeen.vong@adelaide.edu.au) Andrew T. Hendrickson (drew.hendrickson@adelaide.edu.au) Amy Perfors (amy.perfors@adelaide.edu.au)

More information

CHAPTER - 6 STATISTICAL ANALYSIS. This chapter discusses inferential statistics, which use sample data to

CHAPTER - 6 STATISTICAL ANALYSIS. This chapter discusses inferential statistics, which use sample data to CHAPTER - 6 STATISTICAL ANALYSIS 6.1 Introduction This chapter discusses inferential statistics, which use sample data to make decisions or inferences about population. Populations are group of interest

More information

Reliability, validity, and all that jazz

Reliability, validity, and all that jazz Reliability, validity, and all that jazz Dylan Wiliam King s College London Published in Education 3-13, 29 (3) pp. 17-21 (2001) Introduction No measuring instrument is perfect. If we use a thermometer

More information

Regression Discontinuity Analysis

Regression Discontinuity Analysis Regression Discontinuity Analysis A researcher wants to determine whether tutoring underachieving middle school students improves their math grades. Another wonders whether providing financial aid to low-income

More information

GRE R E S E A R C H. Cognitive Patterns of Gender Differences on Mathematics Admissions Tests. Ann Gallagher Jutta Levin Cara Cahalan.

GRE R E S E A R C H. Cognitive Patterns of Gender Differences on Mathematics Admissions Tests. Ann Gallagher Jutta Levin Cara Cahalan. GRE R E S E A R C H Cognitive Patterns of Gender Differences on Mathematics Admissions Tests Ann Gallagher Jutta Levin Cara Cahalan September 2002 GRE Board Professional Report No. 96-17P ETS Research

More information

Medical Statistics 1. Basic Concepts Farhad Pishgar. Defining the data. Alive after 6 months?

Medical Statistics 1. Basic Concepts Farhad Pishgar. Defining the data. Alive after 6 months? Medical Statistics 1 Basic Concepts Farhad Pishgar Defining the data Population and samples Except when a full census is taken, we collect data on a sample from a much larger group called the population.

More information

Analogical Inference

Analogical Inference Analogical Inference An Investigation of the Functioning of the Hippocampus in Relational Learning Using fmri William Gross Anthony Greene Today I am going to talk to you about a new task we designed to

More information

Test item response time and the response likelihood

Test item response time and the response likelihood Test item response time and the response likelihood Srdjan Verbić 1 & Boris Tomić Institute for Education Quality and Evaluation Test takers do not give equally reliable responses. They take different

More information

Examining the Psychometric Properties of The McQuaig Occupational Test

Examining the Psychometric Properties of The McQuaig Occupational Test Examining the Psychometric Properties of The McQuaig Occupational Test Prepared for: The McQuaig Institute of Executive Development Ltd., Toronto, Canada Prepared by: Henryk Krajewski, Ph.D., Senior Consultant,

More information

Business Statistics Probability

Business Statistics Probability Business Statistics The following was provided by Dr. Suzanne Delaney, and is a comprehensive review of Business Statistics. The workshop instructor will provide relevant examples during the Skills Assessment

More information

Improving Individual and Team Decisions Using Iconic Abstractions of Subjective Knowledge

Improving Individual and Team Decisions Using Iconic Abstractions of Subjective Knowledge 2004 Command and Control Research and Technology Symposium Improving Individual and Team Decisions Using Iconic Abstractions of Subjective Knowledge Robert A. Fleming SPAWAR Systems Center Code 24402 53560

More information

The Effect of Guessing on Item Reliability

The Effect of Guessing on Item Reliability The Effect of Guessing on Item Reliability under Answer-Until-Correct Scoring Michael Kane National League for Nursing, Inc. James Moloney State University of New York at Brockport The answer-until-correct

More information

Statistical Techniques. Masoud Mansoury and Anas Abulfaraj

Statistical Techniques. Masoud Mansoury and Anas Abulfaraj Statistical Techniques Masoud Mansoury and Anas Abulfaraj What is Statistics? https://www.youtube.com/watch?v=lmmzj7599pw The definition of Statistics The practice or science of collecting and analyzing

More information

Chapter 3: Examining Relationships

Chapter 3: Examining Relationships Name Date Per Key Vocabulary: response variable explanatory variable independent variable dependent variable scatterplot positive association negative association linear correlation r-value regression

More information

Chapter 2 Norms and Basic Statistics for Testing MULTIPLE CHOICE

Chapter 2 Norms and Basic Statistics for Testing MULTIPLE CHOICE Chapter 2 Norms and Basic Statistics for Testing MULTIPLE CHOICE 1. When you assert that it is improbable that the mean intelligence test score of a particular group is 100, you are using. a. descriptive

More information

Section 3.2 Least-Squares Regression

Section 3.2 Least-Squares Regression Section 3.2 Least-Squares Regression Linear relationships between two quantitative variables are pretty common and easy to understand. Correlation measures the direction and strength of these relationships.

More information

DO NOT OPEN THIS BOOKLET UNTIL YOU ARE TOLD TO DO SO

DO NOT OPEN THIS BOOKLET UNTIL YOU ARE TOLD TO DO SO NATS 1500 Mid-term test A1 Page 1 of 8 Name (PRINT) Student Number Signature Instructions: York University DIVISION OF NATURAL SCIENCE NATS 1500 3.0 Statistics and Reasoning in Modern Society Mid-Term

More information

SUPPLEMENTAL MATERIAL

SUPPLEMENTAL MATERIAL 1 SUPPLEMENTAL MATERIAL Response time and signal detection time distributions SM Fig. 1. Correct response time (thick solid green curve) and error response time densities (dashed red curve), averaged across

More information

Reliability, validity, and all that jazz

Reliability, validity, and all that jazz Reliability, validity, and all that jazz Dylan Wiliam King s College London Introduction No measuring instrument is perfect. The most obvious problems relate to reliability. If we use a thermometer to

More information

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model Gary Skaggs Fairfax County, Virginia Public Schools José Stevenson

More information

Midterm STAT-UB.0003 Regression and Forecasting Models. I will not lie, cheat or steal to gain an academic advantage, or tolerate those who do.

Midterm STAT-UB.0003 Regression and Forecasting Models. I will not lie, cheat or steal to gain an academic advantage, or tolerate those who do. Midterm STAT-UB.0003 Regression and Forecasting Models The exam is closed book and notes, with the following exception: you are allowed to bring one letter-sized page of notes into the exam (front and

More information

Using response time data to inform the coding of omitted responses

Using response time data to inform the coding of omitted responses Psychological Test and Assessment Modeling, Volume 58, 2016 (4), 671-701 Using response time data to inform the coding of omitted responses Jonathan P. Weeks 1, Matthias von Davier & Kentaro Yamamoto Abstract

More information

Addendum: Multiple Regression Analysis (DRAFT 8/2/07)

Addendum: Multiple Regression Analysis (DRAFT 8/2/07) Addendum: Multiple Regression Analysis (DRAFT 8/2/07) When conducting a rapid ethnographic assessment, program staff may: Want to assess the relative degree to which a number of possible predictive variables

More information

A Guide to Clinical Interpretation of the Test of Variables of Attention (T.O.V.A. TM )

A Guide to Clinical Interpretation of the Test of Variables of Attention (T.O.V.A. TM ) A Guide to Clinical Interpretation of the Test of Variables of Attention (T.O.V.A. TM ) Steven J. Hughes, PhD, LP, ABPdN Director of Education and Research The TOVA Company 2008 The TOVA Company Purpose:

More information

Correlational Research. Correlational Research. Stephen E. Brock, Ph.D., NCSP EDS 250. Descriptive Research 1. Correlational Research: Scatter Plots

Correlational Research. Correlational Research. Stephen E. Brock, Ph.D., NCSP EDS 250. Descriptive Research 1. Correlational Research: Scatter Plots Correlational Research Stephen E. Brock, Ph.D., NCSP California State University, Sacramento 1 Correlational Research A quantitative methodology used to determine whether, and to what degree, a relationship

More information

USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION

USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION Iweka Fidelis (Ph.D) Department of Educational Psychology, Guidance and Counselling, University of Port Harcourt,

More information

Statistics. Nur Hidayanto PSP English Education Dept. SStatistics/Nur Hidayanto PSP/PBI

Statistics. Nur Hidayanto PSP English Education Dept. SStatistics/Nur Hidayanto PSP/PBI Statistics Nur Hidayanto PSP English Education Dept. RESEARCH STATISTICS WHAT S THE RELATIONSHIP? RESEARCH RESEARCH positivistic Prepositivistic Postpositivistic Data Initial Observation (research Question)

More information

Research Methods 1 Handouts, Graham Hole,COGS - version 1.0, September 2000: Page 1:

Research Methods 1 Handouts, Graham Hole,COGS - version 1.0, September 2000: Page 1: Research Methods 1 Handouts, Graham Hole,COGS - version 10, September 000: Page 1: T-TESTS: When to use a t-test: The simplest experimental design is to have two conditions: an "experimental" condition

More information

Absolute Identification is Surprisingly Faster with More Closely Spaced Stimuli

Absolute Identification is Surprisingly Faster with More Closely Spaced Stimuli Absolute Identification is Surprisingly Faster with More Closely Spaced Stimuli James S. Adelman (J.S.Adelman@warwick.ac.uk) Neil Stewart (Neil.Stewart@warwick.ac.uk) Department of Psychology, University

More information

SLEEP DISTURBANCE ABOUT SLEEP DISTURBANCE INTRODUCTION TO ASSESSMENT OPTIONS. 6/27/2018 PROMIS Sleep Disturbance Page 1

SLEEP DISTURBANCE ABOUT SLEEP DISTURBANCE INTRODUCTION TO ASSESSMENT OPTIONS. 6/27/2018 PROMIS Sleep Disturbance Page 1 SLEEP DISTURBANCE A brief guide to the PROMIS Sleep Disturbance instruments: ADULT PROMIS Item Bank v1.0 Sleep Disturbance PROMIS Short Form v1.0 Sleep Disturbance 4a PROMIS Short Form v1.0 Sleep Disturbance

More information

USE AND MISUSE OF MIXED MODEL ANALYSIS VARIANCE IN ECOLOGICAL STUDIES1

USE AND MISUSE OF MIXED MODEL ANALYSIS VARIANCE IN ECOLOGICAL STUDIES1 Ecology, 75(3), 1994, pp. 717-722 c) 1994 by the Ecological Society of America USE AND MISUSE OF MIXED MODEL ANALYSIS VARIANCE IN ECOLOGICAL STUDIES1 OF CYNTHIA C. BENNINGTON Department of Biology, West

More information

A Comparison of Three Measures of the Association Between a Feature and a Concept

A Comparison of Three Measures of the Association Between a Feature and a Concept A Comparison of Three Measures of the Association Between a Feature and a Concept Matthew D. Zeigenfuse (mzeigenf@msu.edu) Department of Psychology, Michigan State University East Lansing, MI 48823 USA

More information

WDHS Curriculum Map Probability and Statistics. What is Statistics and how does it relate to you?

WDHS Curriculum Map Probability and Statistics. What is Statistics and how does it relate to you? WDHS Curriculum Map Probability and Statistics Time Interval/ Unit 1: Introduction to Statistics 1.1-1.3 2 weeks S-IC-1: Understand statistics as a process for making inferences about population parameters

More information

Examining differences between two sets of scores

Examining differences between two sets of scores 6 Examining differences between two sets of scores In this chapter you will learn about tests which tell us if there is a statistically significant difference between two sets of scores. In so doing you

More information

CHAPTER VI RESEARCH METHODOLOGY

CHAPTER VI RESEARCH METHODOLOGY CHAPTER VI RESEARCH METHODOLOGY 6.1 Research Design Research is an organized, systematic, data based, critical, objective, scientific inquiry or investigation into a specific problem, undertaken with the

More information

Chapter 2--Norms and Basic Statistics for Testing

Chapter 2--Norms and Basic Statistics for Testing Chapter 2--Norms and Basic Statistics for Testing Student: 1. Statistical procedures that summarize and describe a series of observations are called A. inferential statistics. B. descriptive statistics.

More information

Statistical Techniques. Meta-Stat provides a wealth of statistical tools to help you examine your data. Overview

Statistical Techniques. Meta-Stat provides a wealth of statistical tools to help you examine your data. Overview 7 Applying Statistical Techniques Meta-Stat provides a wealth of statistical tools to help you examine your data. Overview... 137 Common Functions... 141 Selecting Variables to be Analyzed... 141 Deselecting

More information

linking in educational measurement: Taking differential motivation into account 1

linking in educational measurement: Taking differential motivation into account 1 Selecting a data collection design for linking in educational measurement: Taking differential motivation into account 1 Abstract In educational measurement, multiple test forms are often constructed to

More information

Part 1. For each of the following questions fill-in the blanks. Each question is worth 2 points.

Part 1. For each of the following questions fill-in the blanks. Each question is worth 2 points. Part 1. For each of the following questions fill-in the blanks. Each question is worth 2 points. 1. The bell-shaped frequency curve is so common that if a population has this shape, the measurements are

More information

Item Analysis Explanation

Item Analysis Explanation Item Analysis Explanation The item difficulty is the percentage of candidates who answered the question correctly. The recommended range for item difficulty set forth by CASTLE Worldwide, Inc., is between

More information

A model of parallel time estimation

A model of parallel time estimation A model of parallel time estimation Hedderik van Rijn 1 and Niels Taatgen 1,2 1 Department of Artificial Intelligence, University of Groningen Grote Kruisstraat 2/1, 9712 TS Groningen 2 Department of Psychology,

More information

12.1 Inference for Linear Regression. Introduction

12.1 Inference for Linear Regression. Introduction 12.1 Inference for Linear Regression vocab examples Introduction Many people believe that students learn better if they sit closer to the front of the classroom. Does sitting closer cause higher achievement,

More information

Relationships Between the High Impact Indicators and Other Indicators

Relationships Between the High Impact Indicators and Other Indicators Relationships Between the High Impact Indicators and Other Indicators The High Impact Indicators are a list of key skills assessed on the GED test that, if emphasized in instruction, can help instructors

More information

Chapter 1: Introduction to Statistics

Chapter 1: Introduction to Statistics Chapter 1: Introduction to Statistics Variables A variable is a characteristic or condition that can change or take on different values. Most research begins with a general question about the relationship

More information

Running head: How large denominators are leading to large errors 1

Running head: How large denominators are leading to large errors 1 Running head: How large denominators are leading to large errors 1 How large denominators are leading to large errors Nathan Thomas Kent State University How large denominators are leading to large errors

More information

PLANNING THE RESEARCH PROJECT

PLANNING THE RESEARCH PROJECT Van Der Velde / Guide to Business Research Methods First Proof 6.11.2003 4:53pm page 1 Part I PLANNING THE RESEARCH PROJECT Van Der Velde / Guide to Business Research Methods First Proof 6.11.2003 4:53pm

More information

Chapter 1: Exploring Data

Chapter 1: Exploring Data Chapter 1: Exploring Data Key Vocabulary:! individual! variable! frequency table! relative frequency table! distribution! pie chart! bar graph! two-way table! marginal distributions! conditional distributions!

More information

COMPUTING READER AGREEMENT FOR THE GRE

COMPUTING READER AGREEMENT FOR THE GRE RM-00-8 R E S E A R C H M E M O R A N D U M COMPUTING READER AGREEMENT FOR THE GRE WRITING ASSESSMENT Donald E. Powers Princeton, New Jersey 08541 October 2000 Computing Reader Agreement for the GRE Writing

More information

INVESTIGATING FIT WITH THE RASCH MODEL. Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form

INVESTIGATING FIT WITH THE RASCH MODEL. Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form INVESTIGATING FIT WITH THE RASCH MODEL Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form of multidimensionality. The settings in which measurement

More information

Psychology Research Process

Psychology Research Process Psychology Research Process Logical Processes Induction Observation/Association/Using Correlation Trying to assess, through observation of a large group/sample, what is associated with what? Examples:

More information

Computerized Adaptive Testing for Classifying Examinees Into Three Categories

Computerized Adaptive Testing for Classifying Examinees Into Three Categories Measurement and Research Department Reports 96-3 Computerized Adaptive Testing for Classifying Examinees Into Three Categories T.J.H.M. Eggen G.J.J.M. Straetmans Measurement and Research Department Reports

More information