University of Rochester Course Evaluation Project. Ronald D. Rogge. Associate Professor. Ista Zahn. Doctoral Candidate

University f Rchester Curse Evaluatin Prject Rnald D. Rgge Assciate Prfessr Ista Zahn Dctral Candidate Department f Clinical & Scial Sciences in Psychlgy

2 Prject Impetus The current nline administratin f curse evaluatin surveys at the University f Rchester has led t ntably lwer respnse rates, drpping frm the 80-90% rates typically bserved frm in-class administratin t 30-40%, despite effrts by the administratin t prvide incentives t prmte higher participatin. Such a marked drp in participatin culd effectively render the infrmatin btained by thse nline curse evaluatins less reliable and/r excessively biased. T address this issue, we cnducted a study t directly cmpare the quality f infrmatin btained frm nline vs. in-class administratin f curse evaluatins. Hwever, t thrughly examine that mre fcused gal it was advantageus t examine a brader set f 5 inter-related gals. Prject Aims Gal 1: Examining the Current Questins. The prject examined the questins with numeric respnses currently being used in the University f Rchester s curse evaluatin t determine hw much unique infrmatin they ffer instructrs. Gal 2: Evaluating Bias. The prject sught t determine the degree t which knwn biases influence respnses n each f the curse evaluatin questins being used at the University f Rchester. Gal 3: Develping a New Tl. The prject sught t develp a new evaluatin tl that culd ffer instructrs mre useful and diverse feedback with markedly lwer levels f bias. Gal 4: Cntrasting Online t In-Class Administratin. The prject sught t determine if the lwer respnse rates seen with nline administratin might lead t unreliable r excessively biased data. Gal 5: Determining Number f Respnses Needed. The prject sught t determine the minimum number f student respnses necessary t btain reasnably accurate and reliable estimates f curse ratings fr an individual curse. Prject Methd Overall Design. The prject cllected curse evaluatin data frm 1,519 students frm 48 curses acrss 20 departments in the Spring 2010 semester. Students cmpleted the curse evaluatin questins currently in use at the University f Rchester as well as a set f 80 additinal items that included: 1) a diverse pl f curse evaluatin items currently in use at ther universities arund the glbe, and 2) items assessing ptential surces f bias in curse ratings. Selecting Curses. The curses were specifically selected t represent the diversity f curses ffered at the University f Rchester. Curses were selected in pairs frm each department (matching n curse size and curse level) and then curses within each pair were randmly assigned t administer curse evaluatins in-class r via the current nline system. This helped t ensure that nline vs. in-class

cmparisns wuld be made n cmparable sets f curses. The prject was apprved by the Research Subjects Review Bard at the University f Rchester and all students and instructrs invlved were infrmed f their rights prir t cnsenting t the study. We achieved a respnse rate f 67% using in-class administratin and a respnse rate f 40% using nline administratin, mirrring the difference in respnse rates seen at the university level since instituting the nline administratin system. Biases Influencing Respnses. After a thrugh review f the literature in this area and sme preliminary analyses, we fcused the prject nt the fllwing set f biases: student gender, student persnality traits (extraversin, agreeableness, cnscientiusness, penness, neurticism), expected grade, GPA, prir interest in curse, perceived curse difficulty, and perceived instructr attractiveness. These were assessed with standard ne r tw item measures. Prject Results Gal 1: Examining the Current Questins. The prject used explratry factr analyses (EFA) 1 n the questins (with numeric respnses) currently being used in the University f Rchester s curse evaluatin t determine hw many dimensins f infrmatin (factrs) they represent. The EFA analyses n the items currently in use at the University f Rchester suggested that they were essentially measuring fur distinct cnstructs: Overall Quality A set f 5 strngly crrelated items assessing glbal ratings f curse quality What verall rating wuld yu give this curse? What verall rating wuld yu give this instructr? Hw effective was the instructr's teaching in this curse? Rate the increase f yur knwledge r skills frm this curse. I have a strnger interest in this subject because f the instructr. Student Effrt 2 strngly crrelated items assessing students reprts f the effrt they expended Rate the level f yur invlvement in the activities f this curse (fr example: attendance, participatin, cmpleting assignments). What verall rating wuld yu give yurself as a student in this curse? Quality f Readings 1 item assessing the quality f curse readings The readings were imprtant in my learning in this curse. Quality f Syllabus 1 item assessing the quality f curse syllabus Hw well did the syllabus describe the curse cntent? Based n these EFA results, we created cmpsite scres representing verall quality and student effrt s that the remaining analyses culd examine the factrs identified. In additin, althugh instructrs are given 3 1 This technique examines sets f items t help researchers identify subsets f items that seem t be measuring the same underlying cnstruct (the same factr). As a result, EFA enables researchers t take a larger set f items and simplify them dwn int a handful f dimensins represented by thse items. After identifying the subsets f items that make up each dimensin, researchers then examine the item cntent within each f thse dimensins t cme up with descriptive labels t represent what each set f items seem t measuring.

feedback n student respnses t all questins n the current evaluatin frm, there are tw questins (What verall rating wuld yu give this curse? What verall rating wuld yu give this instructr?) that are being used t represent curse quality in faculty activity reprts. Given the emphasis placed n these tw single items, we examined the quality f infrmatin ffered by these individual items in the remaining analyses as well. Gal 2: Evaluating Bias. T determine the degree t which student respnses were unduly influenced by factrs ther than curse quality, we ran analyses 2 in which we allwed a cmprehensive set f biases t predict student respnses n each f the items f the current evaluatin frm. After a thrugh review f the literature and sme preliminary analyses, we fcused the prject nt the fllwing biases: Student gender Student persnality traits Extraversin Agreeableness Emtinal Stability Cnscientiusness Openness Student s expected grade Student GPA Student s prir interest in curse Student perceptins f curse difficulty Student perceptins f instructr attractiveness We als examined a number f additinal surces f bias. Hwever, after cntrlling fr the biases listed abve, these remaining biases failed t demnstrate any influence n student respnses t curse evaluatin items, and s, in the interest f parsimny, they were drpped frm the final analyses. Instructr gender Instructr cmmand f spken English Instructr status (tenure-track vs. nn-tenure track) Curse subject area (engineering vs. natural science vs. scial science) Curse size Curse status (required fr majr vs. elective) Curse level (100 vs. 200 r higher) Class time f day (mrning vs. midday vs. afternn) Class frmat (lectures vs. discussins vs. mixed) Student year Althugh this prject is truly unique in that it will ultimately be the first study in the published literature t examine such a cmprehensive set f biases, there were additinal surces f bias that culd nt be included. Fr example, it wuld have been interesting t examine pssible biases assciated with instructr race r mre curse-specific biases such as students hstility tward race-related curses. Hwever, given the brader gals f the prject and the practical limits f what culd be accmplished in a single semester, we 4 2 We built 2-level HLM regressin mdels (apprpriate t the nested nature f the data) in which individual students were mdeled at level 1 and curses were mdeled at level 2. Surces f bias served as the predictr variables and the 2- level mdel allwed us t enter thse predictrs at bth levels. Fr example, a predictr like instructr attractiveness culd be entered at level 1 as individual students perceptins influencing their wn ratings f curse quality (explaining differences in ratings between students in the same curse). In additin, instructr attractiveness culd be entered at level 2, allwing average ratings f attractiveness within each curse t explain differences in ratings f curse quality between curses. Each item n the curse evaluatin survey then served as the utcme variable in a separate analysis.

5 simply were nt able t examine all pssible surces f bias in this prject. After identifying the final set f biases t be examined, we then determined exactly hw strngly thse biases influenced student respnses fr each f the items and dimensins identified abve. The results suggested that the current evaluatin questins in use at the University f Rchester were strngly influenced by these surces f bias with estimates suggesting that 25-50% f the variability in respnses was culd be explained by bias (see graph). Amng the surces f bias, the results suggested that three biases emerged as the strngest and mst cnsistent predictrs f student respnses: higher expected grades, prir interest in the curse and perceptins f the instructr s attractiveness 3 were assciated with higher ratings n all f the measures tested. Taken as a set, these results indicated that 25-50% f the differences in curse quality ratings acrss the 48 curses f the study culd be explained by factrs like instructr attractiveness and prir interest in the curse. This wuld mean that the average verall curse r instructr ratings fr a specific curse culd easily seem lwer than anther curse nt because there was a true difference in curse quality but nly because students happened t be less interested in that curse prir t enrlling r because the students fund that instructr t be less attractive. Estimates f 25-50% influence by these factrs therefre suggest highly prblematic levels f bias in the evaluatin questins currently in use at the University f Rchester. 3 We assessed perceptins f instructr attractiveness by having each student answer tw questins: Did yu think this instructr was physically attractive? and D yu think ther students wuld find this instructr physically attractive? We chse these questins as they were simple and straightfrward hpefully minimizing differences in interpretatin acrss students. We als put these questins near the very end f the survey s that respnses t them wuld have the lwest chance f influencing any f the ther questins n the survey.

6 Gal 3: Develping a New Tl. Given the high levels f bias assciated with the current evaluatin questins in use at the University f Rchester, the prject sught t develp a new evaluatin tl that culd ffer instructrs mre useful feedback with markedly lwer levels f bias. Drawing a diverse and representative set f items frm the curse evaluatin measures used at ver a dzen universities acrss the glbe, the prject examined that item pl t: 1) estimate the degree t which respnses n each item were influenced by biases, 2) identify the unique dimensins f infrmatin assessed by thse items, and 3) cmbine that infrmatin t create a new set f questins that wuld ffer instructrs the mst diverse infrmatin with the least pssible amunt f bias. Gbal Questins Assessing Overall Quality. These analyses revealed that there were rughly 5 distinct sets f items in the item pl. One f the dimensins assessed verall curse quality, cntaining the 5 items in use at the University f Rchester as well as similar items in use at ther universities. Hwever, the analyses als indicated that all f thse glbally wrded items (e.g., What verall rating wuld yu give this curse? Rate the increase f yur knwledge r skills frm this curse) tended t be strngly influenced by surces f bias like instructr attractiveness. This is cnsistent with a large bdy f literature n what psychlgists wuld term glbal subjective evaluatins. This bdy f research suggests that when individuals are asked t make glbal ratings (integrating different pieces f infrmatin int an verall scre) they tend t d a pr jb f integrating infrmatin and instead rely n an verall gut feeling (r what we wuld call glbal sentiment) causing such verall ratings t be mre strngly influenced by things like the likability r attractiveness f the individual being rated. New Dimensins. Hwever, in additin t identifying that set f glbal items, the analyses revealed 4 new dimensins f infrmatin being assessed at ther universities but nt at the University f Rchester. The items within these 4 new dimensins tended t ask abut mre specific student experiences in the curse (e.g., the instructr used examples in lecture that really helped me understand the material) and ur analyses suggested that respnses t thse questins were far less influenced by surces f bias. Mre imprtantly, these 4 new dimensins culd ffer instructrs richer and mre diverse feedback n different aspects f their curses that might influence student learning and curse quality. This is feedback currently unavailable t University f Rchester instructrs as mst f these dimensins are nt currently represented in ur curse evaluatins. Fr example, ne dimensin asks students specific questins abut pedaggical skills: assessing the instructr s abilities t summarize material, use helpful examples, fit specific tpics int a larger whle and adjust the pace f lectures. Nt nly wuld average ratings n each f these 4 questins ffer useful infrmatin t instructrs t hne their wn teaching skills, but ur analyses suggest that these 4 questins are measuring a cmmn cnstruct and culd therefre be averaged t create a teaching skills cmpsite reflecting an instructr s verall pedaggical skills. A secnd dimensin assesses the quality f curse materials: assessing the quality and clarity f exams and assignments. Nt nly wuld ratings n each f these questins prvide useful infrmatin t instructrs, but ur analyses suggest that such questins culd

7 be averaged t create a quality f exams & assignments cmpsite t reflect the verall quality f an instructr s curse materials. The third dimensin asks students questins abut the instructr s ability t cnnect with students: assessing the degree t which the instructr cnveyed respect fr students, a willingness t listen, and the degree t which the instructr made him r herself available t students. As with the previus tw dimensins, nt nly wuld ratings n each f these questins prvide useful infrmatin t instructrs, but these questins culd be averaged t create a rapprt with students cmpsite reflecting an instructr s verall ability t cnnect with students in his r her class. The final dimensin asks students questins abut the instructr s rganizatin: assessing the degree t which the lectures were rganized, the instructr was able t stay n tpic during lectures, and the lectures were well prepared. Once again, nt nly wuld ratings n each f these questins prvide useful infrmatin t instructrs, but these questins culd be averaged t create a instructr rganizatin cmpsite reflecting an instructr s verall level f rganizatin. Our analyses in this prject ultimately identified 13 items that were able t assess these 4 different aspects f curse quality with lw levels f bias: 1. Teaching skills a. Instructr used examples in lecture / class discussins that really helped me understand the material b. The instructr's way f summarizing r emphasizing imprtant pints was effective c. As the curse prgressed, the instructr shwed me hw each f the tpics fit int a whle d. Instructr nticed when students did nt understand and adjusted the lecture pace accrdingly 2. Quality f Exams & Assignments a. The exam questins were clearly wrded b. The exams cvered imprtant aspects f the curse c. The assignments were helpful in understanding the material 3. Rapprt with Students a. The instructr demnstrated sincere respect fr students b. The instructr was willing t listen t student questins and/r pinins c. The instructr made him r herself available fr extra help 4. Organizatin a. Lectures / class discussins were disrganized (reverse scred) b. Instructr frequently gt ff tpic during lectures (reverse scred) c. The instructr tended t be ill prepared (reverse scred) As seen in the graph quantifying bias abve, respnses t these sets f questins are ntably less influenced by biases than the items currently being used at the University f Rchester. Thus, ur results suggest that using these 13 items t assess varius aspects f curse quality wuld nt nly give instructrs mre useful feedback, but the feedback wuld be less influenced by factrs like prir interest in the curse and instructr attractiveness. Althugh the items asking fr glbal impressins tended t be excessively influenced by bias, we recgnize the utility f being able t have a single glbal scre t reflect curse quality. Tward this end we

8 used multiple regressin analyses t develp an equatin t cnvert the fur mre specific dimensins described abve int an verall curse quality scre 4. Althugh this equatin-generated verall scre ffers equivalent infrmatin t the tw verall items currently being used, the results suggested that by using these 13 mre specific (and less biased) questins t assess curse quality and then cmbining them mathematically (rather than asking students fr glbal ratings) the resulting infrmatin wuld be far less biased by factrs such as instructr attractiveness, students prir interest and expected grade. Thus, the results indicated that by switching t the 13 items abve, instructrs wuld nt nly get mre detailed infrmatin n student s experiences in their curses (by viewing their scres n the 4 dimensins and n the individual 13 items) but culd als be given an verall quality rating that is far less biased by external factrs. T help illustrate the additinal infrmatin that culd be btained thrugh the adptin f the prpsed 13 items, we have created graphs illustrating the current feedback (fcusing n the tw items included in faculty activity reprts) as well as the feedback that culd have been generated frm the 13 items fr 4 simulated curses (based n patterns we bserved in the actual curse data). As seen in panel A belw, the instructr teaching that simulated curse wuld have seen an average curse verall rating f 2.6 and an instructr verall rating f 2.4, suggesting lwer levels f student satisfactin but failing t give specific feedback n what culd be imprved in the future. Hwever, had the 13 items been part f the standard curse evaluatin at the University f Rchester, that instructr wuld have gtten a mathematically cmputed verall rating f 2.95 (smewhat higher nce based n the 13 questins with lwer levels f bias). Mre imprtantly, the instructr wuld have been able t see that the students believed that he r she had dne a fair jb f building rapprt with students and remaining rganized thrughut the semester but fund his r her teaching skills t be less satisfactry. The instructr culd then have examined the average scres n the 4 items making up the teaching skills cmpsite t get even mre detailed feedback n areas that he r she culd imprve. Turning t panel B, the instructr teaching that curse wuld have seen an average curse verall rating f 2.8 and an instructr verall rating f 2.7 with little additinal infrmatin t help guide the future imprvement f that curse. Hwever, had the curse evaluatin been based n the prpsed 13 items, the instructr wuld have been able t see that the students fund him r her t be very rganized and fund the exams and assignments t be helpful but fund the instructr less able t cnnect with the students in the curse and smewhat lacking in the specific teaching skills assessed. Once again, this instructr culd then have examined the average scres n the items making up thse tw cmpsites t get even mre detailed student feedback n areas that he r she culd imprve in the future. Similarly, althugh the 2 current items wuld have prvided psitive feedback t the instructr fr panel C, the prpsed 13 items culd have 4 Specifically, we develped a regressin equatin using scres n the 4 new cmpsites t predict scres n a cmpsite f glbal items. As a result, the mathematically derived glbal scres are n the same 5-pint scale and crrelate very strngly with glbal ratings, suggesting that they are prviding the same infrmatin. Hwever, as thse glbal scres are generated mathematically frm 4 scales with lwer levels f bias, thse glbal scres prvide that infrmatin withut the excessive bias seen in all f the glbal items.

9 infrmed that instructr that his r her rganizatin was a particular strength whereas his r her exams and assignments were less satisfactry in the eyes f students. Finally, the feedback fr the tw current items presented fr the simulated curse in panel D wuld nt nly have failed t prvide mre detailed feedback t the instructr but thse average ratings wuld have had high levels f errr when based n s few respnses (based n the cnfidence intervals suggesting that the means were estimated with an errr f +/- 0.8 pints in this simulated curse). Had the feedback been based n the prpsed 13 items, the instructr wuld have gtten a mre accurate estimate f verall quality (with errrs f nly +/- 0.4 pints) and wuld have seen that the students fund him r her t be smewhat disrganized. Thus, by adpting the prpsed 13 items and giving instructrs feedback n 1) the new glbal cmpsite, 2) the average ratings n the 4 new dimensins, and 3) the average ratings n the individual items, the University f Rchester wuld be prviding instructrs valuable infrmatin t hne the quality f their curses and wuld be prviding that infrmatin with greater

10 levels f accuracy (lwer errr) and lwer levels f bias by factrs like previus interest in curse, expected grade and instructr attractiveness. Gal 4: Cntrasting Online t In-Class Administratin. Using the same mdeling apprach used in Gals 2 and 3, we built mdels allwing the surces f bias assessed in the prject t predict respnses n each f the current evaluatin questins as well as n the 13 prpsed questins f the new scale. In these mdels, we intrduced terms t determine if: 1) average ratings n each item were higher r lwer with nline-administratin, and 2) if any f the surces f bias had strnger r weaker effects with nlineadministratin. This multi-level multivariate apprach ffered a pwerful methd f detecting pssible degradatin f infrmatin frm nline administratin. Hwever, the analyses failed t identify any average differences between evaluatins cllected nline vs. in-class n any f the items r cmpsites examined, suggesting that average curse ratings might in fact be cmparable acrss the tw methds f administratin. Furthermre, when methd f assessment (nline vs. in-class) was intrduced int the mdels, it nly accunted fr a small amunt (0-4%) f the variability in curse ratings. This suggests that 96-100% f the differences in curse ratings bserved between curses was cmpletely unrelated t hw thse ratings were btained. Given the diverse array f curses invlved in the prject, the rigrus design f the prject and the large number f student respnses supprting these analyses, the lack f significant findings fr nline administratin biases is actually quite striking. Taken as a set, these results suggested that cllecting curse evaluatin data nline des nt seem t adversely impact the quality f infrmatin btained despite the markedly lwer respnse rates assciated with that methd. Gal 5: Determining Number f Respnses Needed. Finally, given the lwer respnse rates assciated with nline administratin f curse evaluatins, the prject directly examined the levels f uncertainty that arise with exceedingly small numbers f respndents (i.e., as few as 5 student respnses per class). Using the 88 respnses frm ne f the larger curses in the study as a basis, we calculated the 95% cnfidence intervals fr class averages based n subsamples f 5, 10, 15, 20, 25, 30, 40 and 50 students. Cnfidence intervals prvide upper and lwer bundaries fr a sample average (e.g., average rating n a curse evaluatin questin), typically presented as plus r minus a certain amunt f pints n the scale t indicate the errr assciated with that estimate. As seen in the figure belw, the tw verall quality items demnstrated relatively high levels f errr when estimated in very small (e.g., 5-10) samples f students. In fact, had the average fr the item, What verall rating wuld yu give this curse? been based n nly 5 student respnses, that mean wuld nly have been accurate t nly +/- 0.9 pints. Thus, had the instructr fr this curse gtten an average f 3.5 n that item frm nly 5 respnses, the true mean fr his curse culd have been as high as 4.4 r as lw as 2.6. That reflects a high level f uncertainty, and suggests that class averages based n exceedingly small subsamples f respnses (e.g., as few as 5 r 10 respnses frm curses with far greater numbers f students) shuld be interpreted very cautiusly. The graph als

11 indicates that as respnse rates increase t samples f 40-50 students, these errr rates drp rapidly. Thus, fr larger curses (e.g., curses with 80 r mre students), the results suggest that even respnses frm as few as 40-50 students might still prvide reasnably accurate estimates f curse quality 5. Turning t the results fr specific curse evaluatins items, the results indicated that it wuld require a minimum f 40-50 respnses t btain averages n the tw verall items currently in use at the University f Rchester that were accurate t +/- 0.3 pints n the 5-pint scale. In cntrast, the glbal cmpsite based n respnses t the 13 prpsed questins wuld nly require 20 respnses t btain a similar level f accuracy. Althugh this is nt surprising as multi-item scales tend t give mre reliable and accurate infrmatin, the graph demnstrates just hw much accuracy is gained by switching t a mathematical cmpsite in lieu f the tw glbal items literally ffering cmparable levels f accuracy with half as many respnses (20 vs. 40 respnses t btain an accuracy f +/- 0.3 pints). Thus, the results f the prject nt nly suggest that the new 13-item scale wuld prvide mre diverse and less biased infrmatin, but they als suggest that the glbal cmpsite based n thse 13 items wuld ffer higher levels f accuracy particularly in curses with lw participatin rates. It is wrth nting that fr large curses (e.g., ver 100 students) these estimates f errr are relatively 5 This argument primarily applies t larger curses. If a curse has nly 20 students and 18 f them prvide respnses n the curse evaluatin, despite the uncertainty arising frm means based n nly 18 respnses the resulting means culd still be cnsidered a fair representatin f the curse quality as almst all f the students wuld have participated. Similarly, 7 respnses in a curse f 8 students wuld als give a fair representatin f the curse quality. Hwever, in mderate t large curses (e.g., with 40 r mre students enrlled), means based n nly 18 r 7 respnses wuld be far mre circumspect and the uncertainty wuld be f greater cncern.

12 independent f curse size. Thus, even in a curse f 200 students, btaining 50 respnses wuld give relatively lw errr rates particularly n the prpsed glbal cmpsite based n the 13 items. This prvides additinal evidence t suggest that the lwer respnse rates bserved frm nline administratin might nt be prblematic, particularly fr larger curses where even a 30% respnse rate wuld yield 50 r mre student respnses. Prject Recmmendatins 1) Discntinuing use f excessively biased items. The results strngly suggested that the tw questins currently representing quality f instructin (verall ratings f curse and instructr) as well as a larger cmpsite f the 5 curse quality questins currently in use were strngly influenced by surces f bias. Based n these results, we wuld suggest that the university cnsider discntinuing use f such glbal items. As the students ratings f their wn effrt in the curse als demnstrated excessively high levels f influence by bias, we wuld suggest thse be drpped frm curse evaluatins as well. 2) Shifting t mre specific dmains. The results identified 4 distinct dimensins f curse quality (teaching skills, quality f exams/assignments, rapprt with students, rganizatin) assessed at ther universities. We wuld recmmend that the University f Rchester cnsider shifting the fcus f ur curse evaluatins t these dimensins (instead f simply using glbal evaluative questins) as it wuld prvide instructrs mre detailed and useful feedback n their curses with lwer levels f bias. 3) Using multiple items t assess each dmain. The results suggested that the use f individual items t assess quality (as is the current practice) leads t prblematic levels f errr when small numbers f student respnses are btained fr a specific curse. Specifically, when a curse average is based n a subset f nly 10 students prviding curse evaluatins, the averages n either f the verall items have errr rates f rughly +/- 0.6 pints, suggesting high levels f inaccuracy. As nline administratin has led t markedly lwer respnse rates, this is cncerning because classes with nly 25 students culd easily have 10 r fewer students prvide curse evaluatin data. Hwever, the data als indicated that by using 3-4 items t assess a dmain (rather than 1) and by creating a glbal quality cmpsite based n all 13 items, it was pssible btain estimates with higher accuracy (lwer errr), even in smaller samples. Cnsequently, we identified 13 items (with ntably lwer levels f influence by bias) t assess the fur prpsed dimensins. We wuld recmmend the University f Rchester cnsider adpting the use f these items fr assessing curse quality in lieu f the glbal items currently in use. 4) Using an equatin t generate glbal-quality scres. As the results suggested that asking students glbally evaluative questins n curse quality intrduced high levels f bias, we wuld recmmend

13 using an equatin t synthesize the 4 prpsed dimensins int an verall scre. This equatin was develped in the current dataset t mst clsely represent the infrmatin btained by glbal questins (e.g., What is yur verall rating f this curse?) but ffers that infrmatin with markedly lwer levels f bias and higher levels f accuracy. In fact, the results suggest that it prvides an estimate f verall quality that is relatively stable, even when based n as few as 15-20 respnses. 5) Discntinuing use f largely unused questins. The current curse evaluatin survey ffers an penended questin asking fr cmments after each and every numeric questin. This creates an additinal 16 items n the survey. Hwever, less than 2% f students use the majrity f thse pen-ended questins. Mst students cnstrain their pen-ended cmments t the strengths and weaknesses questins regarding the instructr and the strengths and weaknesses questins regarding the curse. In the interest f parsimny, we wuld recmmend simply retaining thse pen-ended questins that the students actually use. 6) Limiting interpretatin f means frm small samples. Given the higher levels f inaccuracy resulting frm means calculated in samples f fewer than 20 respndents, we wuld recmmend high levels f cautin when interpreting curse data frm smaller numbers f respnses particularly when that represents nly a fractin f the students in the curse. 7) Presenting cnfidence intervals with averages. T mre directly address the issue f errr, all instructrs culd be given 95% cnfidence intervals fr each f the averages presented t them in their student evaluatin feedback. The equatins fr this are very straightfrward and culd easily be prgrammed int the standard nline feedback. Prviding this infrmatin wuld enable the instructrs t determine fr themselves the level f healthy skepticism apprpriate fr any set f means particularly fr means generated frm smaller numbers f student respnses.

CURRENT COURSE EVALUATION QUESTIONS Questin Select a respnse belw Majr Elective Other Requirement Uncertain Status f this curse in yur prgram? Class Year? Freshman Sphmre Junir Senir Graduate Nn Matriculated Other On the student Questin Rate the level f yur invlvement in the activities f this curse (fr example: attendance, participatin, cmpleting assignments). Cmments: Fully Engaged Mstly Engaged Select a respnse belw Average Partially Engaged Minimally Engaged Rate the increase f yur knwledge r skills frm this curse. Cmments: Greatly Increased Increased Average Minimally Increased Nt Increased Very Hardwrking & Prfessinal OK Very Lax & Unprfessinal What verall rating wuld yu give yurself as a student in this curse? Cmments: On the curse Questin Hw well did the syllabus describe the curse cntent? Cmments: Select a respnse belw Extremely Well Very Well Well Nt Well N/A

The readings were imprtant in my learning in this curse. Cmments: Very Imprtant Smewhat Imprtant Nt Imprtant N/A Extremely Well Very Well Well Nt Very Well Prly Hw well did the curse assignments and exams supprt curse bjectives? Cmments: What are the majr strengths f this curse? STRENGTHS: What are the majr weaknesses f this curse? Please make suggestins fr imprvement. POSSIBLE IMPROVEMENTS: What verall rating wuld yu give this curse? OVERALL Cmments n COURSE: Excellent Very Gd Average Nt Very Gd Very Pr On the Instructr Questin Hw respnsive was the instructr in and ut f class? Very Respnsive Select a respnse belw Mstly Respnsive Average Minimally Respnsive Unrespnsive Cmments:

Hw effective was the instructr's teaching in this curse? Very Effective Effective Average Minimally Effective Ineffective Cmments: I have a strnger interest in this subject because f the instructr. Strngly Agree Agree Neutral / Mixed Disagree Strngly Disagree Cmments: What are the majr strengths f this instructr? STRENGTHS: What are the majr weaknesses f this instructr? Please make suggestins fr imprvement. POSSIBLE IMPROVEMENTS: What verall rating wuld yu give this instructr? OVERALL cmments n INSTRUCTOR: Excellent Very Gd Average Nt Very Gd Very Pr Additinal Cmments n INSTRUCTOR: If there are any ther further cmments yu wuld like t make abut this curse, please d s in the space prvided belw.

PROPOSED COURSE EVALUATION QUESTIONS In this class Instructr used examples in lecture / class discussins that really helped me understand the material The instructr's way f summarizing r emphasizing imprtant pints was effective As the curse prgressed, the instructr shwed me hw each f the tpics fit int a whle Instructr nticed when students did nt understand and adjusted the lecture pace accrdingly Nt at all A little Smewhat Quite a bit Very The exam questins were clearly wrded The exams cvered imprtant aspects f the curse The assignments were helpful in understanding the material The instructr demnstrated sincere respect fr students The instructr was willing t listen t student questins and/r pinins The instructr made him r herself available fr extra help Lectures / class discussins were disrganized Instructr frequently gt ff tpic during lectures The instructr tended t be ill prepared What are the majr strengths f this instructr? Please share any suggestins fr imprvement. INSTRUCTOR STRENGTHS: POSSIBLE AREAS FOR IMPROVEMENT: What are the majr strengths f this curse? Please share any suggestins fr imprvement. COURSE STRENGTHS: POSSIBLE AREAS FOR IMPROVEMENT: ADDITIONAL COMMENTS: