Are the Least Frequently Chosen Distractors the Least Attractive?: The Case of a Four-Option Picture-Description Listening Test

Similar documents
Test-Taking Strategies and Task-based Assessment: The Case of Iranian EFL Learners

AN ANALYSIS ON VALIDITY AND RELIABILITY OF TEST ITEMS IN PRE-NATIONAL EXAMINATION TEST SMPN 14 PONTIANAK

2013 Supervisor Survey Reliability Analysis

Reliability and Validity of a Task-based Writing Performance Assessment for Japanese Learners of English

Implicit Information in Directionality of Verbal Probability Expressions

Information and cue-priming effects on tip-of-the-tongue states

Comprehensive Statistical Analysis of a Mathematics Placement Test

How Do We Assess Students in the Interpreting Examinations?

CHAPTER III RESEARCH METHOD. method the major components include: Research Design, Research Site and

Interface Validity Investigating the potential role of face validity in content validation Gábor Szabó, Robert Märcz ECL Examinations

The Effect of Guessing on Item Reliability

Item Analysis Explanation

Correlation and Regression

CHAPTER - 6 STATISTICAL ANALYSIS. This chapter discusses inferential statistics, which use sample data to

THE TEST STATISTICS REPORT provides a synopsis of the test attributes and some important statistics. A sample is shown here to the right.

CHAPTER VI RESEARCH METHODOLOGY

A Brief (very brief) Overview of Biostatistics. Jody Kreiman, PhD Bureau of Glottal Affairs

An experimental investigation of consistency of explanation and graph representation

Psychological testing

Reliability and Validity checks S-005

An insight into the relationships between English proficiency test anxiety and other anxieties

Encoding of Elements and Relations of Object Arrangements by Young Children

Relationships between stage of change for stress management behavior and perceived stress and coping

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS

THE INTERPRETATION OF EFFECT SIZE IN PUBLISHED ARTICLES. Rink Hoekstra University of Groningen, The Netherlands

On the usefulness of the CEFR in the investigation of test versions content equivalence HULEŠOVÁ, MARTINA

Evaluation of CBT for increasing threat detection performance in X-ray screening

To Thine Own Self Be True: A Five-Study Meta-Analysis on the Accuracy of Language-Learner Self-Assessment

Problem #1 Neurological signs and symptoms of ciguatera poisoning as the start of treatment and 2.5 hours after treatment with mannitol.

Evaluation of CBT for increasing threat detection performance in X-ray screening

Effects of Coherence Marking on the Comprehension and Appraisal of Discourse

EVALUATING AND IMPROVING MULTIPLE CHOICE QUESTIONS

Chapter 12: Introduction to Analysis of Variance

Empirical Formula for Creating Error Bars for the Method of Paired Comparison

Comparison of the emotional intelligence of the university students of the Punjab province

Certification of Airport Security Officers Using Multiple-Choice Tests: A Pilot Study

A New Approach to Examining Validity

Reviewing the TIMSS Advanced 2015 Achievement Item Statistics

Investigating the Reliability of Classroom Observation Protocols: The Case of PLATO. M. Ken Cor Stanford University School of Education.

Ego identity, self-esteem, and academic achievement among EFL learners: A relationship study

How People Estimate Effect Sizes: The Role of Means and Standard Deviations

Knowledge as a driver of public perceptions about climate change reassessed

8/28/2017. If the experiment is successful, then the model will explain more variance than it can t SS M will be greater than SS R

THE USE OF CRONBACH ALPHA RELIABILITY ESTIMATE IN RESEARCH AMONG STUDENTS IN PUBLIC UNIVERSITIES IN GHANA.

Reliability Study of ACTFL OPIc in Spanish, English, and Arabic for the ACE Review

Title: Reliability and validity of the adolescent stress questionnaire in a sample of European adolescents - the HELENA study

PSY 216: Elementary Statistics Exam 4

Cross-validation of easycbm Reading Cut Scores in Washington:

English 10 Writing Assessment Results and Analysis

Overestimation of Skills Affects Drivers Adaptation to Task Demands

Shiken: JALT Testing & Evaluation SIG Newsletter. 12 (2). April 2008 (p )

Rater Reliability on Criterionreferenced Speaking Tests in IELTS and Joint Venture Universities

SUMMER 2011 RE-EXAM PSYF11STAT - STATISTIK

Simple ways to improve a test Assessment Institute in Indianapolis Interactive Session Tuesday, Oct 29, 2013

1. Below is the output of a 2 (gender) x 3(music type) completely between subjects factorial ANOVA on stress ratings

Convergence Principles: Information in the Answer

Analogy-Making in Children: The Importance of Processing Constraints

Quality Assessment Criteria in Conference Interpreting from the Perspective of Loyalty Principle Ma Dan

Choosing a Significance Test. Student Resource Sheet

Assessing the Validity and Reliability of the Teacher Keys Effectiveness. System (TKES) and the Leader Keys Effectiveness System (LKES)

for Music Therapy Supervision

Teachers Sense of Efficacy Scale: The Study of Validity and Reliability

Technical Specifications

Basic concepts and principles of classical test theory

Item-Level Examiner Agreement. A. J. Massey and Nicholas Raikes*

Tibor Pólya. Hungarian Academy of Sciences, Budapest, Hungary

Report on FY2014 Annual User Satisfaction Survey on Patent Examination Quality

This self-archived version is provided for scholarly purposes only. The correct reference for this article is as follows:

The Stability of Undergraduate Students Cognitive Test Anxiety Levels

Title: Healthy snacks at the checkout counter: A lab and field study on the impact of shelf arrangement and assortment structure on consumer choices

CHAMP: CHecklist for the Appraisal of Moderators and Predictors

A Cross-validation of easycbm Mathematics Cut Scores in. Oregon: Technical Report # Daniel Anderson. Julie Alonzo.

Internal Consistency and Reliability of the Networked Minds Measure of Social Presence

Running head: THE DEVELOPMENT AND PILOTING OF AN ONLINE IQ TEST. The Development and Piloting of an Online IQ Test. Examination number:

Statistics. Nur Hidayanto PSP English Education Dept. SStatistics/Nur Hidayanto PSP/PBI

Gang Zhou, Xiaochun Niu. Dalian University of Technology, Liao Ning, China

RESULTS. Chapter INTRODUCTION

Students and parents/guardians are highly encouraged to use Parent Connect to track their progress.

Importance of Good Measurement

Providing Evidence for the Generalizability of a Speaking Placement Test Scores

Formulating and Evaluating Interaction Effects

Instrumental activity in achievement motivation1. Department of Child Study, Faculty of Home Economics, Japan Women's University, Bunkyo-ku, Tokyo 112

Biserial Weights: A New Approach

BASIC PRINCIPLES OF ASSESSMENT

On the purpose of testing:

THE EFFECTIVENESS OF VARIOUS TRAINING PROGRAMMES 1. The Effectiveness of Various Training Programmes on Lie Detection Ability and the

Introduction to Meta-Analysis

INSPECT Overview and FAQs

Prosody Rule for Time Structure of Finger Braille

Running Head: ADVERSE IMPACT. Significance Tests and Confidence Intervals for the Adverse Impact Ratio. Scott B. Morris

Principles of Sociology

Original Article. Relationship between sport participation behavior and the two types of sport commitment of Japanese student athletes

Record of the Consultation on Pharmacogenomics/Biomarkers

PTHP 7101 Research 1 Chapter Assignments

Observational Category Learning as a Path to More Robust Generative Knowledge

Project exam in Cognitive Psychology PSY1002. Autumn Course responsible: Kjellrun Englund

Chapter Three. Methodology. This research used experimental design with quasi-experimental

4 Diagnostic Tests and Measures of Agreement

Reliability and Validity of the Divided

Transcription:

Are the Least Frequently Chosen Distractors the Least Attractive?: The Case of a Four-Option Picture-Description Listening Test Hideki IIURA Prefectural University of Kumamoto Abstract This study reports on a replication of Iimura s (2014) study on the attractiveness of distractors in multiple-choice listening tests. Sixty-eight Japanese university students were assessed on their correct responses in a picture-description task of an C listening test (15 questions, four options each). During the listening test, the participants were asked to judge each of the four options as correct or incorrect and report the degree of confidence they had in their judgment. On the basis of the confidence level and correctness of response, confidence and attractiveness scores were generated. To assess how listening ability affected test-takers confidence and distractors attractiveness, three groups were developed on the basis of correct scores obtained on the listening test. The results of the replication study have confirmed the original study, suggesting that (a) the least frequently chosen distractors were not always the least attractive, (b) upper-level listeners were less attracted to distractors, and (c) upper-level listeners had greater confidence when responding with the correct answers. This article concludes that the conventional item analyses (i.e., response frequency and discriminatory power) are insufficient in evaluating the effectiveness of distractors, and a new kind of survey, in which test-takers can evaluate each distractor independently, should be incorporated in future C test development. 1. Introduction A typical multiple-choice (C) item consists of three or four options. One, often known as the key, is correct, while the other options, the distractors, are incorrect answers. Downing (2006) 46

explained that C format is favored in large-scale achievement tests because it has significant validity advantages. Notwithstanding its efficiency for automated scoring, the C format, if properly constructed, can assess a wide range of content and measure high-level cognitive skills. From a listening assessment perspective, the C format can tap into various levels of processing the input (e.g., phonological, word, sentence, pragmatic, or discourse levels: Field, 2012). However, the C format has often been criticized for its limitations, such as inducing random guessing or creating a negative washback effect (Hughes, 2003). From a listening assessment perspective, it imposes heavy cognitive demands because of dual task interference (e.g., audition: listening to the text and vision: reading questions/options: Pashler & Johnson, 1998). Furthermore, valid C items, especially effective distractors, are very difficult to construct. In addition, creating a sufficient number of plausible distractors is extremely difficult (Douglas, 2010). In fact, in his review of C-related articles, Rodriguez (2005) found that only one or two distractors among four or five options functioned as intended. Evidently then, one vital factor in successful C tests is to create functional distractors. As many theorists have contended, the functionality of a distractor has conventionally been evaluated in two ways: (a) discriminability, and (b) response frequency (Fulcher, 2010; Haladyna, 2004; Haladyna & Downing, 1993; Henning, 1987). Discriminability measures the extent to which a distractor can distinguish high-ability from low-ability test-takers. Response frequency measures the number of test-takers who choose each distractor. In other words, dysfunctional distractors have no discriminatory power and have been shown to be unattractive among test-takers. In the test development, distractors that succeed in attracting a certain number of test-takers are judged as functional and therefore remain in the test, whereas distractors that attract few test-takers are judged as unattractive and become targets for modification. Previous studies have indicated that decreasing C options had little effect on the C test results. For example, decreasing options from four to three, by eliminating the least frequently chosen response, did not affect the item difficulty or item discrimination (Shizuka, Takeuchi, 47

Yashima, & Yoshizawa, 2006). Similar results were found comparing five-, four-, and three-option listening tests (Lee & Winke, 2013). The three-option C test differed in item difficulty from the four- and five-option tests, but these three formats did not differ in item discrimination. Although previous studies employed different option deletion methods, namely, item analysis (Shizuka et al., 2006) and evaluators judgments (Lee & Winke, 2013), both studies depended on the identical assumption that the least frequently selected distractors cannot attract the attention of test-takers. Considering the nature of the C test-taking process, further consideration of the relationship between response frequency and the attractiveness of distractors is necessary. It seems reasonable to assume that when taking an C format test, test-takers consider three or four times per item, depending on the number of options, before they finally choose an option. That is, given test-takers must only select one option, some unselected distractors may have functioned as plausible ones. Thus, we can propose that unselected distractors are not uniformly unattractive to test-takers. Therefore, we should evaluate each distractor s attractiveness, separate from its response frequency. 2. The Original Study To evaluate distractor attractiveness, Iimura (2014) developed a questionnaire where the participants were asked to judge each of the options as correct or incorrect and further indicate the degree of confidence for their judgement. He surveyed 75 Japanese university students with the questionnaire to explore the distractor attractiveness based on the following perspectives: 1. Compare keys with distractors with regard to test-takers' confidence levels; 2. Compare the attractiveness of distractors in terms of test-takers' confidence levels and response frequency (p. 21). The original study found that the difference of English proficiency affected test-takers confidence in choosing the correct and incorrect answers with regard to the first perspective. The results showed that there was a significant difference in the degree of confidence in both keys and distractors between test-takers proficiency levels, indicating that proficient listeners were more - 48 -

likely to choose the correct answer with more confidence and that less proficient listeners were more likely to be distracted by distractors. With regard to the second perspective, the study revealed that frequency of response was not always consistent with the attractiveness of distractors. In more than half of the items, the least frequently chosen distractor was not the least attractive. Based on the survey results, he concluded that conventional item analysis based on the response frequency was not sufficient in evaluating the effectiveness of distractors. 3. The Present Study 3.1 Aims Despite the fact that Iimura (2014) revealed the insufficiency of the conventional distractor analysis, further investigation is needed to verify his findings because his study examined only the three-option question-response task. Given that task types can affect test performance (Bachman & Palmer, 2010), it is necessary to examine some tasks other than the question-response. oreover, the number of options should be taken into account. When compared to a three-option format, a four-option format has been more widely used in major English listening tests, such as the Test of English for International Communication (TOEIC: except Part 2 in the listening section), Test of English as a Foreign Language, Institutional Testing Program (TOEFL ITP), and EIKEN. Therefore, it is worth examining the attractiveness of distractors in four-option picture-description of the listening test. The aim of this replication study is to verify the abovementioned findings in Iimura (2014). As in the original study, this study examined the functionality of distractors from the two following research viewpoints: (a) compare keys with distractors with regard to test-takers confidence levels, and (b) compare the attractiveness of distractors in terms of test-takers confidence levels and response frequency. This study closely followed the research procedures adopted in the original study. As will be elaborated in the following sections, the relevant factors such as participants, the questionnaire, and 49

scoring were identical or closely related to those in the original study while altering the task (picture-description) and the number of options (four-option). 3.2 ethod 3.2.1 Participants In line with the original study, this replication study was conducted with Japanese university students (N = 68, mean age = 19.5, SD = 0.5, ale = 39, Female = 28) whose English proficiency levels ranged between A2.1 and B1.2 according to the CEFR-J*1 (Tono, 2013). All participants were native speakers of Japanese and had started learning English in junior high school at the age of 12. Each student had been studying English for at least six years. Prior to data collection, participants were informed that all identifying information would remain confidential, and their permission to participate was obtained. 3.2.2 Listening Test While the original study adopted a three-option question-response task, this study used a four-option picture-description task consisting 15 test items from a TOEIC preparation book (Educational Testing Service, 2011). In this task, test-takers heard four statements (i.e., options) about a picture and were asked to select the one statement that best described the picture. The four statements were only heard, not read, and played only once each. In the original TOEIC CD attached to the preparation book, there was approximately a one-second pause between options and a three-second pause between items. We edited the original CD to have three-second pauses between options for answering the questionnaire and eight-second pauses between questions, to allow participants ample time to answer the listening test and questionnaire. The listening test was played with a CD player. 50

3.2.3 Questionnaire From the design of the questionnaire, it was possible to generate a confidence score for a given option, and the attractiveness of distractors. To elicit test-takers perceptions of keys and distractors, the current study adopted the original study s questionnaire that served as both an answer sheet and a survey to determine confidence levels. Because the original questionnaire was designed for a three-option format, it was tailored for a four-option format. As seen in Figure 1, the questionnaire was symmetrically divided at the middle point (0: I m not sure) into two sides for correctness (correct or incorrect) with three confidence levels, for both correct and incorrect options (H: very confident; : moderately confident; L: not confident). H Incorrect 9HU\ FRQILGHQW L 0RGHUDWHO\ FRQILGHQW 0 1RW YHU\ FRQILGHQW L, P QRW VXUH Correct 1RW YHU\ FRQILGHQW H 0RGHUDWHO\ FRQILGHQW H C 9HU\ FRQILGHQW Incorrect 9HU\ FRQILGHQW L 0RGHUDWHO\ FRQILGHQW 0 1RW YHU\ FRQILGHQW L, P QRW VXUH Correct 1RW YHU\ FRQILGHQW H 0RGHUDWHO\ FRQILGHQW B 9HU\ FRQILGHQW 9HU\ FRQILGHQW 0RGHUDWHO\ FRQILGHQW 1RW YHU\ FRQILGHQW, P QRW VXUH 1RW YHU\ FRQILGHQW 0RGHUDWHO\ FRQILGHQW A 9HU\ FRQILGHQW 9HU\ FRQILGHQW 0RGHUDWHO\ FRQILGHQW 1RW YHU\ FRQILGHQW, P QRW VXUH 1RW YHU\ FRQILGHQW 0RGHUDWHO\ FRQILGHQW 9HU\ FRQILGHQW No.1 H L 0 L H D H Correct L 0 L H Incorrect Correct Incorrect Figure 1. Sample from the questionnaire. Test-takers were asked to circle the part of the line according to the confidence level they had in their judgment of each option s correctness. In addition, they were asked to circle one of the capital letters A, B, C or D representing each option. Figure 1 illustrates a case whereby a participant judged A as incorrect with moderate confidence, B as incorrect with low confidence, C as correct with high confidence, and D as correct with low confidence. Therefore, the test-taker decided on C as his/her final answer. During the experiment, participants were instructed to choose the degree of confidence they had in Option A, immediately after hearing this option. This step was repeated for Options B, C, and D. They were then asked to select and circle one of the four options 51

as their final answer for the item. 3.2.4 Scoring As described above, the questionnaire was utilized to elicit the test-takers perceptions of each option. Identical to the original study, the keys and distractors were separately coded because the questionnaire of keys was intended to elicit test-takers confidence levels toward the correct answer and that of distractors was intended to determine the distractors level of attractiveness, indicative of the test-takers confidence levels toward incorrect answers. 3.2.4.1 Confidence scoring The questionnaire data was coded with values ranging from one to seven on the basis of correctness of the response and test-takers confidence levels. Table 1 illustrates how test-takers confidence levels were converted into scores. The possible scores that could be achieved ranged from one to seven. Low or high scores indicated that participants felt very confident in either incorrect (e.g., one) or correct (e.g., seven) answers, respectively. Table 1 Scoring Table for Confidence Ratings on Keys and Attractiveness in Distractors Confidence level Correctness Confidence/attractiveness Very confident Correct 7 Not very confident Correct 6 Not confident Correct 5 I m not sure 4 Not confident Incorrect 3 Not very confident Incorrect 2 Very confident Incorrect 1-52 -

3.2.4.2 Attractiveness scoring Distractor attractiveness was coded on the basis of correctness and confidence using a scale with values ranging from one to seven. Table 1 illustrates the conversion procedure for generating an attractiveness score from correctness and confidence. The possible scores ranged from one to seven, with low or high scores indicating that a given distractor was deemed by participants as either incorrect or correct with a high level of confidence (although all distractors were incorrect). Thus, a given distractor ranged from not attractive (e.g., one) to attractive (e.g., seven). An attractive distractor appeared to lead the test-taker into deeming it correct with high confidence. Therefore, a code of seven was assigned to those answers where a participant deemed the distractor correct with high confidence. On the other hand, an unattractive distractor only caused test-takers to believe it to be incorrect. Therefore, a distractor judged as incorrect with high confidence was allocated the code of one. 4. Results and Discussion 4.1 Listening Test Performance The average score on the listening test in this replication study was 10.12 (out of 15, or 67% correct) while the mean in the original study was 9.15 (out of 15, or 61%). This difference can be interpreted in two ways: (a) participants in the replication study were more proficient than those in the original study, or (b) the picture-descript task in the replication study was easier than the question-response task in the original study. With regard to the internal consistency reliability (Cronbach's alpha), the replication study (α =.61) was lower than the original study (α =.70) probably because reliability is affected by the amount of variance among test-takers (Green, 2013). In other words, participants' proficiency range in the replication study (SD = 2.30) was smaller than that in the replication study (SD = 2.96). Followed by the original study, as Table 2 shows, participants were divided into three groups on the basis of the number of correct responses given in the listening test: low-score (LG), 53

middle-score (G), and high-score (HG) group. Table 2 Three Groups on the Basis of the Listening Test Group (n) Score range SD 95% CI Reliability LG (14) 5 8 6.71 1.07 [6.10, 7.33] G (34) 9 10 9.94 0.81 [9.66, 10.23].61 HG (20) 11 14 12.80 0.77 [12.44, 13.16] Note. LG = low-score group; G = middle-score group; HG = high-score group. aximum possible score is 15. CI = confidence interval. 4.2 Confidence and Attractiveness 4.2.1 Reliability of the Questionnaire Internal consistency reliability (Cronbach's alpha) of the questionnaire was calculated for (a) confidence (i.e., keys) and (b) attractiveness (i.e., distractors). As Table 3 shows, reliability for confidence (α =.67) was lower than for attractiveness (α =.91) probably because the number of keys (15 one for each item) was smaller than the number of distractors (45 three for each item). This difference was also found in the original study (confidence =.80; attractiveness =.92, respectively). Table 3 Reliability of the Questionnaire Used in This Study Confidence Attractiveness No. of items 15 45 Cronbach s alpha.67.91 4.2.2 Confidence in keys Table 4 outlines a summary of the questionnaire used in this study. As can be seen, the mean confidence score in the keys increased as the English level (i.e., three scored groups) increased, - 54 -

whereas the mean attractiveness score for the distractors decreased as the level of English increased. The one-way ANOVA performed on the confidence scores indicated that the groups significantly differed with a large effect size: F (2, 65) = 31.91, p <.001, η2 =.50. Tukey s post-hoc tests revealed that all groups significantly differed from each other in confidence scores, p <.05. These results were parallel to the findings in the original study where the three groups significantly differed with a large effect size: F (2, 72) = 53.22, p <.001, η2 =.36 and the post-hoc test showed that they differed from each other at the 0.05 level. Table 4 Average Test-Takers Confidence in Keys and Attractiveness of Distractors Score group Low iddle SD Confidence 4.55a 0.39 Attractiveness 3.19a 0.58 High SD SD 4.94a 0.45 5.65a 0.37 2.86b 0.57 2.31a, b 0.47 Note. eans in a row sharing subscripts (a & b) are significantly different from each other. Possible maximum score is seven. These results suggest that advanced test-takers tended to have more confidence than less-advanced test-takers when they responded with correct answers. This may be related to the effective use of metacognitive strategies; that is, proficient listeners can monitor and evaluate their listening comprehension more efficiently than less-proficient listeners (e.g., acaro, Graham, & Vanderplank, 2007; Vandergrift & Goh, 2012). It can therefore be concluded that advanced test-takers respond to choosing the correct answers with more confidence. 4.2.3 Attractiveness of distractors Another one-way ANOVA performed on the attractiveness scores indicated again that the groups differed significantly with a large effect size: F (2, 65) = 12.01, p <.001, η2 =.27. Tukey s 55

post-hoc tests revealed that scores of LG and G did not differ from each other, but that both were significantly higher than scores of HG, p <.05. These results were consistent with the original study 2 where the three groups significantly differed with a large effect size: F(2, 72) = 29.85, p <.001,η =.21 and the post-hoc test demonstrated that LG and G were significantly higher than the attractiveness of distractors for HG at the 0.05 level. These results indicate that lower-level test-takers had a tendency to be more tempted by distractors than advanced test-takers. Thus, it can be concluded that less-advanced test-takers tend to be tempted by distractors. 4.2.4 Response frequency and discrimination in distractors Table 5 reports the response frequency (i.e., how many test-takers chose each option) and results of Chi-square tests for each of the 15 items. In Item 1, for example, 45.6% of the participants (n = 31) chose the key, 39.7% (n = 27) chose Distractor 1, 11.8% (n = 8) selected Distractor 2, and the rest (n = 2) chose Distractor 3. Fisher s exact test*2 was carried out to determine if there was a significant difference in the number of responses between these four options, resulting in a sizable 2 (3) = 35.41, p <.001. Overall, in all 15 items there was a significant difference between them: χ difference in response frequencies between the four choices. Table 5 also illustrates discriminatory power (point-biserial correlation coefficient: rpbi) for each item. According to Haladyna and Rodriguez (2013), distractors should be negatively correlated with total score and that keys should be positively correlated with total test score. As the table shows, correlation in keys was higher than in distractors (Distractor 1, 2, & 3) in the 15 items. oreover, all the distractors produced negative or almost zero point-biserial correlations. Therefore, in line with the original study, all distractors in the replication study can be considered to have functioned properly in terms of discrimination. 56

Table 5 Response Frequency, Fisher s Exact Tests, and Discriminatory Power for Each Item Item Key/IF Distractor 1 Distractor 2 Distractor 3 χ 2 % (r pbi ) % (r pbi ) % (r pbi ) % (r pbi ) d value 1 45.6 (.39) 39.7 (.25) 11.8 (.08) 2.9 (.26) 3 35.41 *** 2 45.6 (.62) 29.4 (.27) 20.6 (.43) 4.4 (.05) 3 24.12 *** 3 50.0 (.54) 26.5 (.16) 22.1 (.43) 1.5 (.16) 3 32.35 *** 4 86.8 (.33) 8.8 (.23) 4.4 (.23) 0 2 87.56 *** 5 95.6 (.34) 4.4 (.34) 0 0 1 56.53 *** 6 52.9 (.57) 23.5 (.41) 20.6 (.26) 2.9 (.02) 3 35.06 *** 7 52.9 (.32) 17.6 (.48) 16.2 (.04) 13.2 (.02) 3 28.59 *** 8 25.0 (.30) 54.4 (.22) 20.6 (.05) 0 2 13.79 ** 9 94.0 (.38) 3.0 (.20) 3.0 (.33) 0 2 113.06 *** 10 97.1 (.23) 1.5 (.06) 1.5 (.26) 0 2 124.27 *** 11 92.6 (.19) 5.9 (.10) 1.5 (.21) 0 2 107.85 *** 12 91.2 (.30) 2.9 (.12) 2.9 (.26) 2.9 (.12) 3 158.82 *** 13 85.3 (.44) 13.2 (.41) 1.5 (.16) 0 2 84.03 *** 14 75.0 (.38) 23.5 (.33) 1.5 (.21) 0 2 58.09 *** 15 23.9 (.37) 38.8 (.31) 25.4 (.04) 11.9 (.07) 3 8.59 * Note. IF = item facility. * p <.05, ** p <.01, *** p <.001. 4.2.5 The lowest attractiveness and response frequency Table 6 presents the results from 3 x 3 (Group x Distractor) two-way mixed ANOVAs that were carried out on the attractiveness scores for each of the 15 C questions. The levels of Distractor variables were based on frequency of responses, such that Dis 1 was the most frequently chosen distractor, Dis 2 the next, and Dis 3 the least chosen. The alpha level was set to.003 (.05/15) because 15 significance tests were conducted one each for each item. The post-hoc analysis was performed with one-way ANOVA followed by t-tests with Bonferroni correction. The results of the ANOVAs can be divided into five categories, in terms of statistically 57

significant differences in two factors (Group x Distractor): (a) interaction (Item 5), (b) Group (Item 2 & 11), (c) Distractor (Item 1, 3, 4, 6, 8, & 15), (d) Group and Distractor (Item 7, 12, 13, & 14), and (e) no difference (Item 9 & 10). Overall, in two-thirds of the items, three distractors significantly differed from each other in terms of attractiveness and also in one-third of the items, the level of the distractors attractiveness varied according to groups. On the basis of the ANOVAs results, this categorization, however, may only provide us with a partial understanding of distractor attractiveness. It is prudent to take into account the results of the response frequency shown in Table 5 cross-checking the rank of attractiveness with the rank of response frequency in the distractors of each item. Table 6 Two-Way ANOVAs on Attractiveness of Three Distractors for 15 Items Source S F p η 2 Post hoc Item 1 Group 6.69 2.23.116 0.02 Distractor 92.80 42.86.001 * 0.27 Dis 3 < 2 < 1 Group x Distractor 1.44 0.67.617 0.01 Item 2 Group 26.40 7.94 <.001 * 0.06 H < = L Distractor 10.46 2.84.062 0.03 Group x Distractor 12.95 3.52.009 0.06 Item 3 Group 2.83 1.32.275 0.01 Distractor 108.85 52.74 <.001 * 0.34 Dis 3 < 1 = 2 Group x Distractor 1.08 0.52.719 0.01 Item 4 Group 1.90 0.56.573 0.01 Distractor 35.64 13.58 <.001 * 0.11 Dis 3 < 2 = 1 Group x Distractor 1.05 0.40.810 0.01 (Continues) - 58 -

Table 6 (Continued) Two-Way ANOVAs on Attractiveness of Three Distractors for 15 Items Source S F p η 2 Post hoc Item 5 Group 4.75 1.84.167 L: Dis 2 = 3 < 1 Distractor 38.88 30.93 <.001 * : Dis 2 < 1 = 3 Group x Distractor 6.97 5.55 <.001 * 0.06 H: Dis 2 = 1 = 3 Item 6 Group 13.89 4.25.018 0.05 Distractor 34.61 21.85 <.001 * 0.13 Dis 3 < 2 = 1 Group x Distractor 1.12 0.71.587 0.01 Item 7 Group 21.37 9.05 <.001 * 0.10 H < = L Distractor 10.41 7.07.001 * 0.05 Dis 1 = 3 < 2 Group x Distractor 4.46 3.03.020 0.04 Item 8 Group 4.59 2.71.074 0.02 Distractor 139.64 99.22 <.001 * 0.48 Dis 3 < 2 < 1 Group x Distractor 10.30 0.73.573 0.01 Item 9 Group 8.09 2.55.086 0.04 Distractor 5.59 4.21.017 0.03 Group x Distractor 0.70 0.53.716 0.01 Item 10 Group 6.37 2.46.094 0.05 Distractor 3.36 5.45.005 0.02 Group x Distractor 0.87 1.41.233 0.01 Item 11 Group 14.67 7.00.002 * 0.08 H < = L Distractor 7.38 5.37.006 0.04 Group x Distractor 2.51 1.83.127 0.03 (Continues) - 59 -

Table 6 (Continued) Two-Way ANOVAs on Attractiveness of Three Distractors for 15 Items Source S F p η 2 Post hoc Item 12 Group 22.68 9.70 <.001 * 0.12 H < = L Distractor 9.81 8.85 <.001 * 0.05 Dis 3 = 2 < 1 Group x Distractor 1.00 0.91.463 0.01 Item 13 Group 35.31 19.09 <.001 * 0.14 H < = L Distractor 68.20 55.19 <.001 * 0.28 Dis 2 = 3 < 1 Group x Distractor 0.10 0.08.989 0.00 Item 14 Group 18.50 10.31 <.001 * 0.08 H < = L Distractor 72.57 63.23 <.001 * 0.32 Dis 3 = 2 < 1 Group x Distractor 0.90 0.78.539 0.01 Item 15 Group 7.50 3.41.039 0.03 Distractor 18.27 10.30 <.001 * 0.08 Dis 3 = 2 < 1 Group x Distractor 1.76 0.99.413 0.02 Note. Dis 1 = the most frequently chosen distractor; Dis 2 = the second-most frequently chosen distractor; Dis 3 = the least frequently chosen distractor. H = high-scoring group; = middle-scoring group; L = low-scoring group. * p <.003. Table 7 presents four categories that combined the results of the ANOVAs with the response frequency results. It should be noted here that in Table 7, items were separated by the results, regardless of whether the main effect of Group was statistically significant. Although this information may not be crucial for identifying the relationship between the least attractive and the least frequent selection of distractors, it reveals a tendency of the test-takers perceptions, in terms of overall attractiveness toward the distractors. - 60 -

Table 7 Items Categorized on the Basis of the Relationship between Attractiveness and Response Frequency Category: The least frequently chosen distractor is 1. the least attractive ain effect of group ns sig 1, 3, 4, 6, 8 2. as attractive as the second frequently chosen distractor 3. as attractive as the most frequently chosen distractor 2, 12, 13, 14, 9, 10 7, 11 Others (Item 5) The first category, The least frequently chosen distractor is the least attractive, was derived from the results when the main effect of the Distractor was shown to be significantly different between three distractors, and also the lowest attractive distractor became the least frequently selected distractor. This category reflects the conventional assumption that a distractor s attractiveness is commensurate with response frequency. Only five out of the 15 items fell into this category (items 1, 3, 4, 6, & 8). The second category, The least frequently chosen distractor is as attractive as the second frequently chosen distractor, referred to the results (a) when there was no significant main effect of the Distractor between the three distractors, or (b) even when the main effect of the Distractor was significantly different between three distractors and the second frequently chosen distractor had no greater attractiveness than the least chosen distractor. Five out of the 15 items fell into this category (items 2, 12, 13, 14, & 15). The third category, The least frequently chosen distractor is as attractive as the most frequently chosen distractor, referred to the results (a) when no significant main effect of the Distractor was found, or (b) even when the main effect of the Distractor was significantly different between three distractors, and the most frequently chosen distractor had no greater attractiveness than the least frequently chosen distractor. Four out of the 15 items were identified as falling into this category (items 7, 9, 10, & 11). 61

The last category (Others) was drawn from the results where there was a significant interaction between the Group and the Distractor (Item 5). ore specifically, for the LG, the least frequently chosen distractors (dis 2 & 3) were the least attractive distractors. For the G and HG, on the other hand, the least frequently chosen distractors were not the least attractive. In summary, these results showed that in one-third of the items, the least frequently selected distractor was the least attractive, but that for more than half the items, the least frequently chosen distractor was as attractive as the second, or even the most frequently chosen distractor. One might argue that some items in this study too easily attracted test-takers. In fact, item facility of several items (items 5, 9, 10, 11, 12) exceeded 90% and the mean attractiveness score of the distractors in those items was relatively low. Not all the easy items, however, failed to attract test-takers. In Item 11, for instance, the mean distractor attractiveness was not low ( = 2.3 for HG; = 3.0 for G; and = 3.4 for LG, respectively). Hence, it can be said that distractors, in even easy items, did attract test-takers to some extent. This finding is well accordance with the original study; there were only five out of 15 items in which the least frequently chosen distractor was also the least attractive, whereas in seven out of 15 items, the least frequently chosen distractor was not the least attractive. These findings have led us to reconsider the conventional way of evaluating distractors and response frequency. As found in this replication study as well as the original study, there were several unselected distractors that were still able to function properly. It is necessary to differentiate distractors that were neither chosen nor attractive from those that were not chosen but plausible enough to attract test-takers. Therefore, it is possible that response frequency is insufficient for evaluating the effectiveness of distractors. Instead, incorporating some tools that can elicit test-takers' perceptions of each distractor is required. The questionnaire developed by Iimura (2014) could be recommended for use because it has shown strong internal consistency in the original and this replication study. Having said that, one might claim how it could be possible to incorporate such tools in the actual test design. In the high-stake test development, checking new items are usually conducted as pilot testing or field testing. Examining attractiveness of each distractor should 62

be done in such trialing phases with using the mentioned above questionnaire. 5. Conclusion Functionality of distractors has predominantly been evaluated by response frequency. Namely, prior investigations have assumed a direct relationship between response frequency and the attractiveness of distractors. This study has attempted to reexamine the relationship between attractiveness and frequency in distractors by using a questionnaire that revealed test-takers confidence in each option. The findings of this study strongly support Iimura (2014) in that (a) upper-level test-takers had greater confidence when responding with the correct answers, (b) lower-level test-takers were more attracted to distractors, and (c) the least frequently chosen distractors were not necessarily the least attractive. Therefore, it can be surmised that conventional item analyses are insufficient for evaluating the effectiveness of distractors during test development. In other words, a new kind of survey, in which test-takers can evaluate each distractor independently, should be incorporated into C test development. This study examined a picture-description listening task with four options to investigate distractor attractiveness that has been examined in a previous study with a three-option question-response listening task. Having said that, both question-response and picture-description tasks were relatively simple in comparison to passage comprehension tasks that contain more complex information within the passage itself, the question stems, and options. Future research should include a passage comprehension task with longer dialogues or monologues. Acknowledgements This work was supported by JSPS (Japan Society for the Promotion of Science) KAKENHI (Grand-in-Aid for Scientific Research) Grant number [15K02790]. 63

Notes 1. English listening proficiency levels from A2.1 to B1.2 in CEFR-J are equivalent to from 110 (185) to 335 (395) in TOEIC listening scores (Tono, 2013, p. 229). 2. Instead of Chi-square tests, Fisher s exact tests were suitable when the expected frequencies were too low (e.g., sample size is quite small: Field, 2009). References Bachman, L., & Palmer, A. (2010). Language assessment in practice. Oxford University Press. Douglas, D. (2010). Understanding language testing. London: Hodder Education. Downing, S.. (2006). Selected-response item formats in test development. In S.. Downing & T.. Haladyna (Eds.), Handbook of test development (pp. 287 301). ahwah, NJ: Lawrence Erlbaum Associates. Educational Testing Service. (2011). TOEIC test official practice: Listening. Tokyo: Institute for International Business Communication. Field, A. (2009). Discovering statistics using SPSS (3rd ed.). London: Sage. Field, J. (2012). Cognitive validity. In A. Geranpayeh & L. Taylor (Eds.), Examining listening: Research and practice in assessing second language listening (pp. 77 241). Cambridge University Press. Fulcher, G. (2010). Practical language testing. London: Hodder Education. Green, R. (2013). Statistical analyses for language testers. Hampshire: Palgrave acmillan. Haladyna, T.. (2004). Developing and validating multiple-choice test items (3rd ed.). ahwah, NJ: Lawrence Erlbaum Associates. Haladyna, T.., & Downing, S.. (1993). How many options is enough for a multiple-choice test item? Educational and Psychological easurement, 53, 999 1010. doi:10.1177/0013164493053004013 Haladyna, T.., & Rodriguez,. C. (2013). Developing and validating test items. New York: 64

Routledge. Henning, G. (1987). A guide to language testing: Development, evaluation, research. Cambridge, A: Newbury House Publisher. Hughes, A. (2003). Testing for language teachers (2nd ed.). Cambridge University Press. Iimura, H. (2014). Attractiveness of distractors in multiple-choice listening tests. JLTA Journal, 17, 19 39. Lee, H., & Winke, P. (2013). The differences among three-, four-, and five-option-item formats in the context of a high-stakes English-language listening tests. Language Testing, 30, 99 123. doi: 10.1177/0265532212451235 acaro, E., Graham, S., & Vanderplank, R. (2007). A review of listening strategies: Focus on sources of knowledge and on success. In A. D. Cohen & E. acaro (Eds.), Language learner strategies (pp. 165 185). Oxford University Press. Pashler, H., & Johnson, J. C. (1998). Attentional limitations in dual-task performance. In H. Pashler (Ed.), Attention (pp. 155 189). Hove, East Sussex: Psychology Press. Rodriguez,. C. (2005). Three options are optimal for multiple-choice items: A meta-analysis of 80 years of research. Educational easurement: Issues and Practice, 24(2), 3 13. doi: 10.1111/j.1745-3992.2005.00006.x Shizuka, T., Takeuchi, O., Yashima, T., & Yoshizawa, K. (2006). A comparison of three- and four-option English tests for university entrance selection purposes in Japan. Language Testing, 23, 35 57. doi:10.1191/0265532206lt319oa Tono, Y. (Ed.). (2013). The CEFR-J handbook: A resource book for using CAN-DO descriptors for English language teaching. Tokyo: Taishukan. Vandergrift, L., & Goh, C. C.. (2012). Teaching and learning second language listening: etacognition in action. New York, NY: Routledge. 65