Are the Least Frequently Chosen Distractors the Least Attractive?: The Case of a Four-Option Picture-Description Listening Test

Size: px

Start display at page:

Download "Are the Least Frequently Chosen Distractors the Least Attractive?: The Case of a Four-Option Picture-Description Listening Test"

Elvin Sherman
5 years ago
Views:

1 Are the Least Frequently Chosen Distractors the Least Attractive?: The Case of a Four-Option Picture-Description Listening Test Hideki IIURA Prefectural University of Kumamoto Abstract This study reports on a replication of Iimura s (2014) study on the attractiveness of distractors in multiple-choice listening tests. Sixty-eight Japanese university students were assessed on their correct responses in a picture-description task of an C listening test (15 questions, four options each). During the listening test, the participants were asked to judge each of the four options as correct or incorrect and report the degree of confidence they had in their judgment. On the basis of the confidence level and correctness of response, confidence and attractiveness scores were generated. To assess how listening ability affected test-takers confidence and distractors attractiveness, three groups were developed on the basis of correct scores obtained on the listening test. The results of the replication study have confirmed the original study, suggesting that (a) the least frequently chosen distractors were not always the least attractive, (b) upper-level listeners were less attracted to distractors, and (c) upper-level listeners had greater confidence when responding with the correct answers. This article concludes that the conventional item analyses (i.e., response frequency and discriminatory power) are insufficient in evaluating the effectiveness of distractors, and a new kind of survey, in which test-takers can evaluate each distractor independently, should be incorporated in future C test development. 1. Introduction A typical multiple-choice (C) item consists of three or four options. One, often known as the key, is correct, while the other options, the distractors, are incorrect answers. Downing (2006) 46

2 explained that C format is favored in large-scale achievement tests because it has significant validity advantages. Notwithstanding its efficiency for automated scoring, the C format, if properly constructed, can assess a wide range of content and measure high-level cognitive skills. From a listening assessment perspective, the C format can tap into various levels of processing the input (e.g., phonological, word, sentence, pragmatic, or discourse levels: Field, 2012). However, the C format has often been criticized for its limitations, such as inducing random guessing or creating a negative washback effect (Hughes, 2003). From a listening assessment perspective, it imposes heavy cognitive demands because of dual task interference (e.g., audition: listening to the text and vision: reading questions/options: Pashler & Johnson, 1998). Furthermore, valid C items, especially effective distractors, are very difficult to construct. In addition, creating a sufficient number of plausible distractors is extremely difficult (Douglas, 2010). In fact, in his review of C-related articles, Rodriguez (2005) found that only one or two distractors among four or five options functioned as intended. Evidently then, one vital factor in successful C tests is to create functional distractors. As many theorists have contended, the functionality of a distractor has conventionally been evaluated in two ways: (a) discriminability, and (b) response frequency (Fulcher, 2010; Haladyna, 2004; Haladyna & Downing, 1993; Henning, 1987). Discriminability measures the extent to which a distractor can distinguish high-ability from low-ability test-takers. Response frequency measures the number of test-takers who choose each distractor. In other words, dysfunctional distractors have no discriminatory power and have been shown to be unattractive among test-takers. In the test development, distractors that succeed in attracting a certain number of test-takers are judged as functional and therefore remain in the test, whereas distractors that attract few test-takers are judged as unattractive and become targets for modification. Previous studies have indicated that decreasing C options had little effect on the C test results. For example, decreasing options from four to three, by eliminating the least frequently chosen response, did not affect the item difficulty or item discrimination (Shizuka, Takeuchi, 47

3 Yashima, & Yoshizawa, 2006). Similar results were found comparing five-, four-, and three-option listening tests (Lee & Winke, 2013). The three-option C test differed in item difficulty from the four- and five-option tests, but these three formats did not differ in item discrimination. Although previous studies employed different option deletion methods, namely, item analysis (Shizuka et al., 2006) and evaluators judgments (Lee & Winke, 2013), both studies depended on the identical assumption that the least frequently selected distractors cannot attract the attention of test-takers. Considering the nature of the C test-taking process, further consideration of the relationship between response frequency and the attractiveness of distractors is necessary. It seems reasonable to assume that when taking an C format test, test-takers consider three or four times per item, depending on the number of options, before they finally choose an option. That is, given test-takers must only select one option, some unselected distractors may have functioned as plausible ones. Thus, we can propose that unselected distractors are not uniformly unattractive to test-takers. Therefore, we should evaluate each distractor s attractiveness, separate from its response frequency. 2. The Original Study To evaluate distractor attractiveness, Iimura (2014) developed a questionnaire where the participants were asked to judge each of the options as correct or incorrect and further indicate the degree of confidence for their judgement. He surveyed 75 Japanese university students with the questionnaire to explore the distractor attractiveness based on the following perspectives: 1. Compare keys with distractors with regard to test-takers' confidence levels; 2. Compare the attractiveness of distractors in terms of test-takers' confidence levels and response frequency (p. 21). The original study found that the difference of English proficiency affected test-takers confidence in choosing the correct and incorrect answers with regard to the first perspective. The results showed that there was a significant difference in the degree of confidence in both keys and distractors between test-takers proficiency levels, indicating that proficient listeners were more

4 likely to choose the correct answer with more confidence and that less proficient listeners were more likely to be distracted by distractors. With regard to the second perspective, the study revealed that frequency of response was not always consistent with the attractiveness of distractors. In more than half of the items, the least frequently chosen distractor was not the least attractive. Based on the survey results, he concluded that conventional item analysis based on the response frequency was not sufficient in evaluating the effectiveness of distractors. 3. The Present Study 3.1 Aims Despite the fact that Iimura (2014) revealed the insufficiency of the conventional distractor analysis, further investigation is needed to verify his findings because his study examined only the three-option question-response task. Given that task types can affect test performance (Bachman & Palmer, 2010), it is necessary to examine some tasks other than the question-response. oreover, the number of options should be taken into account. When compared to a three-option format, a four-option format has been more widely used in major English listening tests, such as the Test of English for International Communication (TOEIC: except Part 2 in the listening section), Test of English as a Foreign Language, Institutional Testing Program (TOEFL ITP), and EIKEN. Therefore, it is worth examining the attractiveness of distractors in four-option picture-description of the listening test. The aim of this replication study is to verify the abovementioned findings in Iimura (2014). As in the original study, this study examined the functionality of distractors from the two following research viewpoints: (a) compare keys with distractors with regard to test-takers conﬁdence levels, and (b) compare the attractiveness of distractors in terms of test-takers conﬁdence levels and response frequency. This study closely followed the research procedures adopted in the original study. As will be elaborated in the following sections, the relevant factors such as participants, the questionnaire, and 49

5 scoring were identical or closely related to those in the original study while altering the task (picture-description) and the number of options (four-option). 3.2 ethod Participants In line with the original study, this replication study was conducted with Japanese university students (N = 68, mean age = 19.5, SD = 0.5, ale = 39, Female = 28) whose English proficiency levels ranged between A2.1 and B1.2 according to the CEFR-J*1 (Tono, 2013). All participants were native speakers of Japanese and had started learning English in junior high school at the age of 12. Each student had been studying English for at least six years. Prior to data collection, participants were informed that all identifying information would remain confidential, and their permission to participate was obtained Listening Test While the original study adopted a three-option question-response task, this study used a four-option picture-description task consisting 15 test items from a TOEIC preparation book (Educational Testing Service, 2011). In this task, test-takers heard four statements (i.e., options) about a picture and were asked to select the one statement that best described the picture. The four statements were only heard, not read, and played only once each. In the original TOEIC CD attached to the preparation book, there was approximately a one-second pause between options and a three-second pause between items. We edited the original CD to have three-second pauses between options for answering the questionnaire and eight-second pauses between questions, to allow participants ample time to answer the listening test and questionnaire. The listening test was played with a CD player. 50

6 3.2.3 Questionnaire From the design of the questionnaire, it was possible to generate a confidence score for a given option, and the attractiveness of distractors. To elicit test-takers perceptions of keys and distractors, the current study adopted the original study s questionnaire that served as both an answer sheet and a survey to determine confidence levels. Because the original questionnaire was designed for a three-option format, it was tailored for a four-option format. As seen in Figure 1, the questionnaire was symmetrically divided at the middle point (0: I m not sure) into two sides for correctness (correct or incorrect) with three confidence levels, for both correct and incorrect options (H: very confident; : moderately confident; L: not confident). H Incorrect 9HU\ FRQILGHQW L 0RGHUDWHO\ FRQILGHQW 0 1RW YHU\ FRQILGHQW L, P QRW VXUH Correct 1RW YHU\ FRQILGHQW H 0RGHUDWHO\ FRQILGHQW H C 9HU\ FRQILGHQW Incorrect 9HU\ FRQILGHQW L 0RGHUDWHO\ FRQILGHQW 0 1RW YHU\ FRQILGHQW L, P QRW VXUH Correct 1RW YHU\ FRQILGHQW H 0RGHUDWHO\ FRQILGHQW B 9HU\ FRQILGHQW 9HU\ FRQILGHQW 0RGHUDWHO\ FRQILGHQW 1RW YHU\ FRQILGHQW, P QRW VXUH 1RW YHU\ FRQILGHQW 0RGHUDWHO\ FRQILGHQW A 9HU\ FRQILGHQW 9HU\ FRQILGHQW 0RGHUDWHO\ FRQILGHQW 1RW YHU\ FRQILGHQW, P QRW VXUH 1RW YHU\ FRQILGHQW 0RGHUDWHO\ FRQILGHQW 9HU\ FRQILGHQW No.1 H L 0 L H D H Correct L 0 L H Incorrect Correct Incorrect Figure 1. Sample from the questionnaire. Test-takers were asked to circle the part of the line according to the confidence level they had in their judgment of each option s correctness. In addition, they were asked to circle one of the capital letters A, B, C or D representing each option. Figure 1 illustrates a case whereby a participant judged A as incorrect with moderate confidence, B as incorrect with low confidence, C as correct with high confidence, and D as correct with low confidence. Therefore, the test-taker decided on C as his/her final answer. During the experiment, participants were instructed to choose the degree of confidence they had in Option A, immediately after hearing this option. This step was repeated for Options B, C, and D. They were then asked to select and circle one of the four options 51

7 as their final answer for the item Scoring As described above, the questionnaire was utilized to elicit the test-takers perceptions of each option. Identical to the original study, the keys and distractors were separately coded because the questionnaire of keys was intended to elicit test-takers confidence levels toward the correct answer and that of distractors was intended to determine the distractors level of attractiveness, indicative of the test-takers confidence levels toward incorrect answers Confidence scoring The questionnaire data was coded with values ranging from one to seven on the basis of correctness of the response and test-takers confidence levels. Table 1 illustrates how test-takers confidence levels were converted into scores. The possible scores that could be achieved ranged from one to seven. Low or high scores indicated that participants felt very confident in either incorrect (e.g., one) or correct (e.g., seven) answers, respectively. Table 1 Scoring Table for Confidence Ratings on Keys and Attractiveness in Distractors Confidence level Correctness Confidence/attractiveness Very confident Correct 7 Not very confident Correct 6 Not confident Correct 5 I m not sure 4 Not confident Incorrect 3 Not very confident Incorrect 2 Very confident Incorrect

8 Attractiveness scoring Distractor attractiveness was coded on the basis of correctness and confidence using a scale with values ranging from one to seven. Table 1 illustrates the conversion procedure for generating an attractiveness score from correctness and confidence. The possible scores ranged from one to seven, with low or high scores indicating that a given distractor was deemed by participants as either incorrect or correct with a high level of confidence (although all distractors were incorrect). Thus, a given distractor ranged from not attractive (e.g., one) to attractive (e.g., seven). An attractive distractor appeared to lead the test-taker into deeming it correct with high confidence. Therefore, a code of seven was assigned to those answers where a participant deemed the distractor correct with high confidence. On the other hand, an unattractive distractor only caused test-takers to believe it to be incorrect. Therefore, a distractor judged as incorrect with high confidence was allocated the code of one. 4. Results and Discussion 4.1 Listening Test Performance The average score on the listening test in this replication study was (out of 15, or 67% correct) while the mean in the original study was 9.15 (out of 15, or 61%). This difference can be interpreted in two ways: (a) participants in the replication study were more proficient than those in the original study, or (b) the picture-descript task in the replication study was easier than the question-response task in the original study. With regard to the internal consistency reliability (Cronbach's alpha), the replication study (α =.61) was lower than the original study (α =.70) probably because reliability is affected by the amount of variance among test-takers (Green, 2013). In other words, participants' proficiency range in the replication study (SD = 2.30) was smaller than that in the replication study (SD = 2.96). Followed by the original study, as Table 2 shows, participants were divided into three groups on the basis of the number of correct responses given in the listening test: low-score (LG), 53

9 middle-score (G), and high-score (HG) group. Table 2 Three Groups on the Basis of the Listening Test Group (n) Score range SD 95% CI Reliability LG (14) [6.10, 7.33] G (34) [9.66, 10.23].61 HG (20) [12.44, 13.16] Note. LG = low-score group; G = middle-score group; HG = high-score group. aximum possible score is 15. CI = confidence interval. 4.2 Confidence and Attractiveness Reliability of the Questionnaire Internal consistency reliability (Cronbach's alpha) of the questionnaire was calculated for (a) confidence (i.e., keys) and (b) attractiveness (i.e., distractors). As Table 3 shows, reliability for confidence (α =.67) was lower than for attractiveness (α =.91) probably because the number of keys (15 one for each item) was smaller than the number of distractors (45 three for each item). This difference was also found in the original study (confidence =.80; attractiveness =.92, respectively). Table 3 Reliability of the Questionnaire Used in This Study Confidence Attractiveness No. of items Cronbach s alpha Confidence in keys Table 4 outlines a summary of the questionnaire used in this study. As can be seen, the mean confidence score in the keys increased as the English level (i.e., three scored groups) increased,

10 whereas the mean attractiveness score for the distractors decreased as the level of English increased. The one-way ANOVA performed on the confidence scores indicated that the groups significantly differed with a large effect size: F (2, 65) = 31.91, p <.001, η2 =.50. Tukey s post-hoc tests revealed that all groups significantly differed from each other in confidence scores, p <.05. These results were parallel to the findings in the original study where the three groups significantly differed with a large effect size: F (2, 72) = 53.22, p <.001, η2 =.36 and the post-hoc test showed that they differed from each other at the 0.05 level. Table 4 Average Test-Takers Confidence in Keys and Attractiveness of Distractors Score group Low iddle SD Confidence 4.55a 0.39 Attractiveness 3.19a 0.58 High SD SD 4.94a a b a, b 0.47 Note. eans in a row sharing subscripts (a & b) are significantly different from each other. Possible maximum score is seven. These results suggest that advanced test-takers tended to have more confidence than less-advanced test-takers when they responded with correct answers. This may be related to the effective use of metacognitive strategies; that is, proficient listeners can monitor and evaluate their listening comprehension more efficiently than less-proficient listeners (e.g., acaro, Graham, & Vanderplank, 2007; Vandergrift & Goh, 2012). It can therefore be concluded that advanced test-takers respond to choosing the correct answers with more confidence Attractiveness of distractors Another one-way ANOVA performed on the attractiveness scores indicated again that the groups differed significantly with a large effect size: F (2, 65) = 12.01, p <.001, η2 =.27. Tukey s 55

11 post-hoc tests revealed that scores of LG and G did not differ from each other, but that both were significantly higher than scores of HG, p <.05. These results were consistent with the original study 2 where the three groups significantly differed with a large effect size: F(2, 72) = 29.85, p <.001,η =.21 and the post-hoc test demonstrated that LG and G were significantly higher than the attractiveness of distractors for HG at the 0.05 level. These results indicate that lower-level test-takers had a tendency to be more tempted by distractors than advanced test-takers. Thus, it can be concluded that less-advanced test-takers tend to be tempted by distractors Response frequency and discrimination in distractors Table 5 reports the response frequency (i.e., how many test-takers chose each option) and results of Chi-square tests for each of the 15 items. In Item 1, for example, 45.6% of the participants (n = 31) chose the key, 39.7% (n = 27) chose Distractor 1, 11.8% (n = 8) selected Distractor 2, and the rest (n = 2) chose Distractor 3. Fisher s exact test*2 was carried out to determine if there was a significant difference in the number of responses between these four options, resulting in a sizable 2 (3) = 35.41, p <.001. Overall, in all 15 items there was a significant difference between them: χ difference in response frequencies between the four choices. Table 5 also illustrates discriminatory power (point-biserial correlation coefficient: rpbi) for each item. According to Haladyna and Rodriguez (2013), distractors should be negatively correlated with total score and that keys should be positively correlated with total test score. As the table shows, correlation in keys was higher than in distractors (Distractor 1, 2, & 3) in the 15 items. oreover, all the distractors produced negative or almost zero point-biserial correlations. Therefore, in line with the original study, all distractors in the replication study can be considered to have functioned properly in terms of discrimination. 56

12 Table 5 Response Frequency, Fisher s Exact Tests, and Discriminatory Power for Each Item Item Key/IF Distractor 1 Distractor 2 Distractor 3 χ 2 % (r pbi ) % (r pbi ) % (r pbi ) % (r pbi ) d value (.39) 39.7 (.25) 11.8 (.08) 2.9 (.26) *** (.62) 29.4 (.27) 20.6 (.43) 4.4 (.05) *** (.54) 26.5 (.16) 22.1 (.43) 1.5 (.16) *** (.33) 8.8 (.23) 4.4 (.23) *** (.34) 4.4 (.34) *** (.57) 23.5 (.41) 20.6 (.26) 2.9 (.02) *** (.32) 17.6 (.48) 16.2 (.04) 13.2 (.02) *** (.30) 54.4 (.22) 20.6 (.05) ** (.38) 3.0 (.20) 3.0 (.33) *** (.23) 1.5 (.06) 1.5 (.26) *** (.19) 5.9 (.10) 1.5 (.21) *** (.30) 2.9 (.12) 2.9 (.26) 2.9 (.12) *** (.44) 13.2 (.41) 1.5 (.16) *** (.38) 23.5 (.33) 1.5 (.21) *** (.37) 38.8 (.31) 25.4 (.04) 11.9 (.07) * Note. IF = item facility. * p <.05, ** p <.01, *** p < The lowest attractiveness and response frequency Table 6 presents the results from 3 x 3 (Group x Distractor) two-way mixed ANOVAs that were carried out on the attractiveness scores for each of the 15 C questions. The levels of Distractor variables were based on frequency of responses, such that Dis 1 was the most frequently chosen distractor, Dis 2 the next, and Dis 3 the least chosen. The alpha level was set to.003 (.05/15) because 15 significance tests were conducted one each for each item. The post-hoc analysis was performed with one-way ANOVA followed by t-tests with Bonferroni correction. The results of the ANOVAs can be divided into five categories, in terms of statistically 57

13 significant differences in two factors (Group x Distractor): (a) interaction (Item 5), (b) Group (Item 2 & 11), (c) Distractor (Item 1, 3, 4, 6, 8, & 15), (d) Group and Distractor (Item 7, 12, 13, & 14), and (e) no difference (Item 9 & 10). Overall, in two-thirds of the items, three distractors significantly differed from each other in terms of attractiveness and also in one-third of the items, the level of the distractors attractiveness varied according to groups. On the basis of the ANOVAs results, this categorization, however, may only provide us with a partial understanding of distractor attractiveness. It is prudent to take into account the results of the response frequency shown in Table 5 cross-checking the rank of attractiveness with the rank of response frequency in the distractors of each item. Table 6 Two-Way ANOVAs on Attractiveness of Three Distractors for 15 Items Source S F p η 2 Post hoc Item 1 Group Distractor * 0.27 Dis 3 < 2 < 1 Group x Distractor Item 2 Group <.001 * 0.06 H < = L Distractor Group x Distractor Item 3 Group Distractor <.001 * 0.34 Dis 3 < 1 = 2 Group x Distractor Item 4 Group Distractor <.001 * 0.11 Dis 3 < 2 = 1 Group x Distractor (Continues)

14 Table 6 (Continued) Two-Way ANOVAs on Attractiveness of Three Distractors for 15 Items Source S F p η 2 Post hoc Item 5 Group L: Dis 2 = 3 < 1 Distractor <.001 * : Dis 2 < 1 = 3 Group x Distractor <.001 * 0.06 H: Dis 2 = 1 = 3 Item 6 Group Distractor <.001 * 0.13 Dis 3 < 2 = 1 Group x Distractor Item 7 Group <.001 * 0.10 H < = L Distractor * 0.05 Dis 1 = 3 < 2 Group x Distractor Item 8 Group Distractor <.001 * 0.48 Dis 3 < 2 < 1 Group x Distractor Item 9 Group Distractor Group x Distractor Item 10 Group Distractor Group x Distractor Item 11 Group * 0.08 H < = L Distractor Group x Distractor (Continues)

15 Table 6 (Continued) Two-Way ANOVAs on Attractiveness of Three Distractors for 15 Items Source S F p η 2 Post hoc Item 12 Group <.001 * 0.12 H < = L Distractor <.001 * 0.05 Dis 3 = 2 < 1 Group x Distractor Item 13 Group <.001 * 0.14 H < = L Distractor <.001 * 0.28 Dis 2 = 3 < 1 Group x Distractor Item 14 Group <.001 * 0.08 H < = L Distractor <.001 * 0.32 Dis 3 = 2 < 1 Group x Distractor Item 15 Group Distractor <.001 * 0.08 Dis 3 = 2 < 1 Group x Distractor Note. Dis 1 = the most frequently chosen distractor; Dis 2 = the second-most frequently chosen distractor; Dis 3 = the least frequently chosen distractor. H = high-scoring group; = middle-scoring group; L = low-scoring group. * p <.003. Table 7 presents four categories that combined the results of the ANOVAs with the response frequency results. It should be noted here that in Table 7, items were separated by the results, regardless of whether the main effect of Group was statistically significant. Although this information may not be crucial for identifying the relationship between the least attractive and the least frequent selection of distractors, it reveals a tendency of the test-takers perceptions, in terms of overall attractiveness toward the distractors

16 Table 7 Items Categorized on the Basis of the Relationship between Attractiveness and Response Frequency Category: The least frequently chosen distractor is 1. the least attractive ain effect of group ns sig 1, 3, 4, 6, 8 2. as attractive as the second frequently chosen distractor 3. as attractive as the most frequently chosen distractor 2, 12, 13, 14, 9, 10 7, 11 Others (Item 5) The first category, The least frequently chosen distractor is the least attractive, was derived from the results when the main effect of the Distractor was shown to be significantly different between three distractors, and also the lowest attractive distractor became the least frequently selected distractor. This category reflects the conventional assumption that a distractor s attractiveness is commensurate with response frequency. Only five out of the 15 items fell into this category (items 1, 3, 4, 6, & 8). The second category, The least frequently chosen distractor is as attractive as the second frequently chosen distractor, referred to the results (a) when there was no significant main effect of the Distractor between the three distractors, or (b) even when the main effect of the Distractor was significantly different between three distractors and the second frequently chosen distractor had no greater attractiveness than the least chosen distractor. Five out of the 15 items fell into this category (items 2, 12, 13, 14, & 15). The third category, The least frequently chosen distractor is as attractive as the most frequently chosen distractor, referred to the results (a) when no significant main effect of the Distractor was found, or (b) even when the main effect of the Distractor was significantly different between three distractors, and the most frequently chosen distractor had no greater attractiveness than the least frequently chosen distractor. Four out of the 15 items were identified as falling into this category (items 7, 9, 10, & 11). 61

17 The last category (Others) was drawn from the results where there was a significant interaction between the Group and the Distractor (Item 5). ore specifically, for the LG, the least frequently chosen distractors (dis 2 & 3) were the least attractive distractors. For the G and HG, on the other hand, the least frequently chosen distractors were not the least attractive. In summary, these results showed that in one-third of the items, the least frequently selected distractor was the least attractive, but that for more than half the items, the least frequently chosen distractor was as attractive as the second, or even the most frequently chosen distractor. One might argue that some items in this study too easily attracted test-takers. In fact, item facility of several items (items 5, 9, 10, 11, 12) exceeded 90% and the mean attractiveness score of the distractors in those items was relatively low. Not all the easy items, however, failed to attract test-takers. In Item 11, for instance, the mean distractor attractiveness was not low ( = 2.3 for HG; = 3.0 for G; and = 3.4 for LG, respectively). Hence, it can be said that distractors, in even easy items, did attract test-takers to some extent. This finding is well accordance with the original study; there were only five out of 15 items in which the least frequently chosen distractor was also the least attractive, whereas in seven out of 15 items, the least frequently chosen distractor was not the least attractive. These findings have led us to reconsider the conventional way of evaluating distractors and response frequency. As found in this replication study as well as the original study, there were several unselected distractors that were still able to function properly. It is necessary to differentiate distractors that were neither chosen nor attractive from those that were not chosen but plausible enough to attract test-takers. Therefore, it is possible that response frequency is insufficient for evaluating the effectiveness of distractors. Instead, incorporating some tools that can elicit test-takers' perceptions of each distractor is required. The questionnaire developed by Iimura (2014) could be recommended for use because it has shown strong internal consistency in the original and this replication study. Having said that, one might claim how it could be possible to incorporate such tools in the actual test design. In the high-stake test development, checking new items are usually conducted as pilot testing or field testing. Examining attractiveness of each distractor should 62

18 be done in such trialing phases with using the mentioned above questionnaire. 5. Conclusion Functionality of distractors has predominantly been evaluated by response frequency. Namely, prior investigations have assumed a direct relationship between response frequency and the attractiveness of distractors. This study has attempted to reexamine the relationship between attractiveness and frequency in distractors by using a questionnaire that revealed test-takers confidence in each option. The findings of this study strongly support Iimura (2014) in that (a) upper-level test-takers had greater confidence when responding with the correct answers, (b) lower-level test-takers were more attracted to distractors, and (c) the least frequently chosen distractors were not necessarily the least attractive. Therefore, it can be surmised that conventional item analyses are insufficient for evaluating the effectiveness of distractors during test development. In other words, a new kind of survey, in which test-takers can evaluate each distractor independently, should be incorporated into C test development. This study examined a picture-description listening task with four options to investigate distractor attractiveness that has been examined in a previous study with a three-option question-response listening task. Having said that, both question-response and picture-description tasks were relatively simple in comparison to passage comprehension tasks that contain more complex information within the passage itself, the question stems, and options. Future research should include a passage comprehension task with longer dialogues or monologues. Acknowledgements This work was supported by JSPS (Japan Society for the Promotion of Science) KAKENHI (Grand-in-Aid for Scientific Research) Grant number [15K02790]. 63

19 Notes 1. English listening proficiency levels from A2.1 to B1.2 in CEFR-J are equivalent to from 110 (185) to 335 (395) in TOEIC listening scores (Tono, 2013, p. 229). 2. Instead of Chi-square tests, Fisher s exact tests were suitable when the expected frequencies were too low (e.g., sample size is quite small: Field, 2009). References Bachman, L., & Palmer, A. (2010). Language assessment in practice. Oxford University Press. Douglas, D. (2010). Understanding language testing. London: Hodder Education. Downing, S.. (2006). Selected-response item formats in test development. In S.. Downing & T.. Haladyna (Eds.), Handbook of test development (pp ). ahwah, NJ: Lawrence Erlbaum Associates. Educational Testing Service. (2011). TOEIC test official practice: Listening. Tokyo: Institute for International Business Communication. Field, A. (2009). Discovering statistics using SPSS (3rd ed.). London: Sage. Field, J. (2012). Cognitive validity. In A. Geranpayeh & L. Taylor (Eds.), Examining listening: Research and practice in assessing second language listening (pp ). Cambridge University Press. Fulcher, G. (2010). Practical language testing. London: Hodder Education. Green, R. (2013). Statistical analyses for language testers. Hampshire: Palgrave acmillan. Haladyna, T.. (2004). Developing and validating multiple-choice test items (3rd ed.). ahwah, NJ: Lawrence Erlbaum Associates. Haladyna, T.., & Downing, S.. (1993). How many options is enough for a multiple-choice test item? Educational and Psychological easurement, 53, doi: / Haladyna, T.., & Rodriguez,. C. (2013). Developing and validating test items. New York: 64

20 Routledge. Henning, G. (1987). A guide to language testing: Development, evaluation, research. Cambridge, A: Newbury House Publisher. Hughes, A. (2003). Testing for language teachers (2nd ed.). Cambridge University Press. Iimura, H. (2014). Attractiveness of distractors in multiple-choice listening tests. JLTA Journal, 17, Lee, H., & Winke, P. (2013). The differences among three-, four-, and five-option-item formats in the context of a high-stakes English-language listening tests. Language Testing, 30, doi: / acaro, E., Graham, S., & Vanderplank, R. (2007). A review of listening strategies: Focus on sources of knowledge and on success. In A. D. Cohen & E. acaro (Eds.), Language learner strategies (pp ). Oxford University Press. Pashler, H., & Johnson, J. C. (1998). Attentional limitations in dual-task performance. In H. Pashler (Ed.), Attention (pp ). Hove, East Sussex: Psychology Press. Rodriguez,. C. (2005). Three options are optimal for multiple-choice items: A meta-analysis of 80 years of research. Educational easurement: Issues and Practice, 24(2), doi: /j x Shizuka, T., Takeuchi, O., Yashima, T., & Yoshizawa, K. (2006). A comparison of three- and four-option English tests for university entrance selection purposes in Japan. Language Testing, 23, doi: / lt319oa Tono, Y. (Ed.). (2013). The CEFR-J handbook: A resource book for using CAN-DO descriptors for English language teaching. Tokyo: Taishukan. Vandergrift, L., & Goh, C. C.. (2012). Teaching and learning second language listening: etacognition in action. New York, NY: Routledge. 65

Test-Taking Strategies and Task-based Assessment: The Case of Iranian EFL Learners

Test-Taking Strategies and Task-based Assessment: The Case of Iranian EFL Learners Hossein Barati Department of English, Faculty of Foreign Languages, University of Isfahan barati@yahoo.com Zohreh Kashkoul*