loudspeaker. When this occurs, respondents agree, on average, only half the time (as one would expect). Recalculate percentage agreement and for the s

Size: px
Start display at page:

Download "loudspeaker. When this occurs, respondents agree, on average, only half the time (as one would expect). Recalculate percentage agreement and for the s"

Transcription

1 ANNALS OF EMERGENCY MEDICINE JOURNAL CLUB A Consideration of the Measurement and Reporting of Interrater Reliability Answers to the July 2009 Journal Club Questions Frank C. Day, MD, MPH David L. Schriger, MD, MPH From the University of California, Los Angeles, Los Angeles, CA /$-see front matter Copyright 2009 by the American College of Emergency Physicians. doi: /j.annemergmed Editor s Note: This 10th installment of Annals of Emergency Medicine Journal Club departs slightly from previous installments by focusing on a single methodological issue, the measurement of reliability. We use the Cruz et al article as a jumping-off point for our discussion. 1 Although this installment may be appropriate for some residency journal clubs (particularly if they use our more basic questions and add some clinical questions about the article), we suspect that it will be of greater value to research fellows and researchers. Readers should recognize that these are suggested answers and, although it is hoped that they are correct, are by no means comprehensive. There are many other points that could be made about these questions or about the article in general. Questions are rated novice, ( ) intermediate, ( ) and advanced ( ). DISCUSSION POINTS 1. Cruz et al 1 contains 2 parts, a comparison of the values gathered by trained research assistants and physicians about historical information in chest pain patients and the comparison of these participants recordings with a correct value for each item. A. For each part, indicate whether the authors are studying reliability or validity and explain the difference between these concepts. B. What did the authors use as their criterion standard for the validity analysis? C. What are potential problems with their method of defining the criterion (gold) standard? Can you think of alternative approaches? D. The authors report crude agreement and interquartile range for their validity analysis. What part of a distribution is described by the interquartile range? List other statistics used to describe the validity of a measure and why they might be preferable to reporting crude agreement. 2. Crude percentage agreement is a simple way to report reliability. Consider the contingency table for the question, Was the quality of the chest pain crushing? (yes or no): MD Recorded Yes MD Recorded No Total RA recorded yes RA recorded no Total MD, Medical doctor; RA, research assistant. A. Calculate the crude percentage agreement for this table. What is the range of possible values for percentage agreement? B. Calculate Cohen s for this table. What is the formula for for raters making a binary assessment (eg, yes/no or true/false)? Discuss the purpose of Cohen s, its range, and the interpretations of key values such as 1, 0, and 1. C. What other measures can be used to measure reliability for binary, categorical, and continuous data? 3. Cruz et al quote the oft-cited Landis and Koch 2 article stating that a of less than 0.2 represents poor agreement; 0.21 to 0.40, fair agreement; 0.41 to 0.60, moderate agreement; 0.61 to 0.80, good agreement; and 0.81 to 1.00, excellent agreement. Consider studies of the agreement of airline pilots deciding whether it is safe to land and psychologists deciding whether interviewees have type A or type B personalities. If the studies produced the same numeric value, would the adjectives assigned by Landis and Koch be equally appropriate? 4. A. Imagine 2 blindfolded, intelligent individuals who are sitting in distant corners of a room and listening to 100 easy true/false statements such as red is a color, 2 2 5, etc, over a loudspeaker. Each indicates his or her choice by pressing a button in the left hand for false and in the right for true. Questions are not repeated and the respondents are expected to offer a response for each statement. Verify that if they agree on all 100 answers, percentage agreement is 100 and is 1.0, regardless of how many statements are true and how many are false. Now imagine that the testing site is under the final approach for a major airport, and, at times, noise from jets flying overhead drowns out the statements from the Volume 54, NO. 6 : December 2009 Annals of Emergency Medicine 843

2 loudspeaker. When this occurs, respondents agree, on average, only half the time (as one would expect). Recalculate percentage agreement and for the same 100- statement test conducted under the following sets of conditions: (1) half the statements are true and 1% of the statements are rendered incomprehensible by the planes; (2) 90% of the statements are true and 1% of the statements are rendered incomprehensible by the planes; (3) half the statements are true and 20% of the statements are rendered incomprehensible by the planes; and (4) 90% of the statements are true and 20% of the statements are rendered incomprehensible by the planes. Discuss the meaning of percentage agreement and in these 4 settings. B. Consider the 2 tables below and calculate percentage agreement and for each. Why is lower on the right? What does this mean? Imagine that the right-hand table was from the true/false experiment described above and that planes were flying so frequently that every question was somewhat difficult to hear. Imagine 2 scenarios: in the first, both raters are told that there are 80 true statements and 20 false statements. In the second, raters are told that there could be 100 true statements with no false statements, 100 false statements with no true statements, or any combination in between, with each having an equal probability of occurring. Does mean the same thing in these 2 situations? C. To further consider the meaning of, imagine that planes flew overhead such that 60 statements were heard perfectly and 40 were barely comprehensible or not heard at all. Below are separate tables for the 60 audible and 40 incomprehensible statements. 5. Finally, the following graph shows percentage agreement versus for the first 50 items in Table 1 of Cruz et al. The points are shaded to indicate how many subjects fall into the smallest cell in the 2 2 table. A. Four lines in the table are denoted with square markers (near the arrow) on the graph (Is pain burning? Does it radiate to the back? Does it radiate to the jaw? Does it radiate to the left arm?). Create (approximate) 2 2 tables for these 4 points. Can you explain why these tables have similar percentage agreement but varying s? Which do you believe is the better measure? Why do the s differ? kappa Kappa vs. % Agreement, stratified on size of smallest cell 54 historical items from Table 1, Cruz et al, n=143 Size of smallest cell < % 20% 40% 60% 80% 100% Percent agreement Figure 1 B. Can you comment on the relationship between the size of the smallest cell in the 2 2 table and the extent to which may deviate from percentage agreement? C. Given the problems with both percentage agreement and illustrated in these examples, do you think it would be better if investigators presented the 4 numbers in the inner cells of each 2 2 table instead of reporting the percentage agreement or? Calculate percentage agreement and for these tables. Which is the better measure for each? Consider the confidence level of the raters in the different scenarios presented in this exercise. Should rater confidence be considered when interrater reliability is described? How might this be done? ANSWER 1 Q1. Cruz et al contains 2 parts, a comparison of the values gathered by trained research assistants and physicians regarding historical information in chest pain patients, and the comparison of these participants recordings with a correct value for each item. Q1.a For each part, indicate whether the authors are studying reliability or validity and explain the difference between these concepts. The first part is an assessment of reliability, and the second is an assessment of validity. The distinction between reliability and validity is an important one. At the racetrack handicappers may unanimously agree (100% interrater 844 Annals of Emergency Medicine Volume 54, NO. 6 : December 2009

3 reliability) that Galloping George will win the third race. When he comes in dead last, however, track aficionados receive a painful reminder that even a perfectly reliable analysis does not guarantee a valid result. The reliability of a test speaks only to the agreement obtained when multiple fallible observers independently conduct the test on the same persons, specimens, or images. In contrast, an assessment of validity compares a fallible observer against a criterion, or gold, standard. Because the criterion standard is assumed to be correct, validity studies typically report the comparative performance of the fallible observer using statistics such as sensitivity and specificity, or likelihood ratios, not reliability metrics such as percentage agreement or. Q1.b What did the authors use as their criterion standard for the validity analysis? When the physician and the research assistant agree, it is assumed that their answer is correct. When they disagree, a different research assistant has the patient select which of the 2 discrepant answers is correct. Q1.c What are potential problems with their method of defining the gold (criterion) standard? Can you think of any alternative approaches? Defining a gold (criterion) standard for this study is not trivial. For any item, there can be 2 truths: what the patient answers and what is actually true. For example, a patient asked Do you have pain in the epigastric region might say yes, thinking that epigastric is a fancy word for butt cheeks when in fact the true answer is no. Or a patient might say that his cholesterol is normal despite its being 300 because he does not understand the laboratory results his physician shared with him. What then is the criterion standard for this study: what the patient said, what the patient should have said, or what the patient would say if the information were optimally elicited? The answer, of course, is that we have no way to know what the true answer is. Most emergency medicine residents have had the experience of reporting some part of a patient s history to their attending physician and soon thereafter hearing the patient give the attending physician a completely different history! This answer drift could be because one or the other of the physicians asked the question in a manner that was clearer to the patient, or because extra time or reflection resulted in the patient expressing a different answer. We do not know if the answers given on the second or third interview are more truthful, or whether they are provided to appease the interviewers and end the questioning. Some patients may be too confused, in too much pain, or too distracted to give an accurate reply. A better approach, which the authors acknowledge, would have been to randomize the order in which the research assistant and physician interviewed each patient. That should equally distribute and minimize the effect of any bias related to a change in answer accuracy with repeated questioning. Q1.d The authors report crude agreement and interquartile range for their validity analysis. What part of a distribution is described by the interquartile range? List other statistics used to describe the validity of a measure and why they might be preferable to reporting crude agreement. The interquartile range (IQR) refers to the middle 50% of a distribution: Figure 2. (Adapted from: /Interquartile_range.) [From Wikipedia: In the Creative Commons Attribution and Share Alike license (CC-BY-SA), re-users are free to make derivative works and copy, distribute, display, and perform the work, even commercially. When re-using the work or distributing it, you must attribute the work to the author(s) and you must mention the license terms or a link to them. You must make your version available under CC-BY-SA. ] The original definition of IQR is a single number that represents the distance from the 25th percentile to the 75th percentile, though this format is seldom used. Instead, investigators typically present the 25th and 75th percentile (in the format [25th percentile, 75th percentile]) from which the real IQR can easily be gleaned by subtraction. In statistics, the term quartiles refers to the 3 points that divide a distribution into 4 equal parts. In epidemiology, the term is typically used to signify these 4 equal parts. The second quartile is called the median, the first the 25th percentile, and the third the 75th percentile. The IQR is the difference between the third and first quartiles. This is a more robust (less influenced by outlier observations) descriptive statistic than the range of a distribution, and it is more relevant when data are not in the shape of a classic bell curve (ie, not normally distributed ). When data are skewed and thus not normally distributed (Figure 2), the mean, median, and IQR convey the center of the observed values. In this case, the mean 2 SDs (central tendency Volume 54, NO. 6 : December 2009 Annals of Emergency Medicine 845

4 statistics for a normal bell curve distribution) goes from roughly 0.8 to 2.8. Note that the left-sided value of 0.8 is well outside the range of the data. Consequently, if the only thing readers were told about this distribution was that the mean is 1 and the mean 2 SD is 0.8 to 2.8, they would likely imagine a curve very different from Figure 2 and would likely assume that values between 0 and 0.8 existed. For those questions in which the research assistant and physician did not agree, the authors report (by category) the percentage agreement with the correct answers (as determined by the tiebreaker criterion standard). Percentage agreement is a reasonable statistic for a reliability assessment, but is not the appropriate statistic to best describe this validity assessment (comparison of a fallible observer with a criterion standard). Studies that are designed to estimate a test s validity should report statistics such as sensitivity and specificity, or likelihood ratios, not reliability metrics such as percentage agreement or. ANSWER 2 Q2. Crude percentage agreement is a simple way to report reliability. Consider the contingency table for the question, Was the quality of the chest pain crushing? (yes or no) : MD Recorded Yes MD Recorded No Total RA recorded yes 117[a] 6[b] 123 RA recorded no 18[c] 2[d] 20 Total Q2.a Calculate the crude percentage agreement for this table. What is the range of possible values for percentage agreement? The 2 observers both agreed yes 117 times, and no 2 times. Thus, crude percentage agreement for this table is (117 2)/ %. Percentage agreement can range between 0% and 100%. Q2.b Calculate Cohen s for this table. What is the formula for for raters making a binary assessment (eg, yes/no or true/false)? Discuss the purpose of Cohen s, its range, and the interpretations of key values such as 1, 0, and 1. The statistic, introduced by Cohen 3 in 1960, is defined as: % Agreement observed % Agreement expected due to chance 1 % Agreement expected due to chance is easily calculated with statistical software, but we will discuss the manual method as well. For 2 2 contingency tables, it is customary to refer to the inner cells by letters, with [a] and [b] on the top row and [c] and [d] just below. The outer 5 cells represent various row and column totals of the 4 inner cells a to d. Because these five cells occupy the margin of the table, they are often referred to as marginal totals. Note that if one knows the values of inner cells a to d, then one can calculate all 5 marginal totals. The reverse is not true; in most circumstances one cannot determine the inner cells from the marginal totals. The agreement cells for this table (where the 2 raters both recorded the same thing, either yes or no) are [a] and [d]. uses the marginal totals to calculate the percentage of expected agreement due to chance for each agreement cell, and these are summed to determine the expected percent agreement. Using the values from Table 1, is calculated as follows: (i) The expected value of cell a due to chance alone is: a expected a b * a c 16, a b c d 143 (ii) The expected value of cell d due to chance alone is: d expected c d * b d a b c d 143 (iii) The percentage agreement due to chance alone is: % Agreement due to chance a d a b c d 143 At the beginning of this section, we calculated that observed agreement was 83.2%, or Now, we have determined that the expected agreement due to chance alone is Plugging these 2 numbers into the formula yields ( ) 0.07 (1.820) was introduced as a coefficient of agreement for nominal scales 3 intended to measure agreement beyond chance. can range from -1 (with negative numbers indicating that observed agreement occurs less often than expected by chance) to 1 (perfect agreement when observed percentage agreement 1, regardless of percentage agreement expected due to chance). A of zero signifies that observed agreement is exactly that expected by chance alone: percentage observed agreement percentage expected agreement). An inherent assumption of the statistic is that the marginal totals of the observed agreement table adequately define chance agreement. This assumption, like many of the assumptions in classic statistics, implies that all observations are independent, identically distributed, and drawn from the same probability density function. Under these very limited and strict conditions, chance agreement will be a function of the observed marginals. We explain these assumptions in layman s terms in subsequent questions. Q2.c What other measures can be used to measure reliability for binary, categorical, and continuous data? 846 Annals of Emergency Medicine Volume 54, NO. 6 : December 2009

5 Reliability can be measured with a multitude of methods. An excellent review 4 emphasizes that there is little consensus about which is best and that no one method is appropriate for all occasions. It is important to consider what kinds of data are being compared. Categorical (also called discrete) variables take on a small, finite number of values. These qualitative variables include nominal (no meaningful order, such as disposition admitted, transferred, or home) and ordinal (ordered in a meaningful sequence, such as Glasgow Coma Scale score 3 to 15). A binary variable is a categorical variable with only 2 options (female or male; yes or no). Continuous variables (such as pulse rate) can theoretically take on an infinite number of values (a patient s pulse could be precisely beats/min), but both clinical relevance and measurement accuracy effectively categorize most continuous variables (pulse rate is estimated to the nearest integer). Many reliability measurements are intended for use only with continuous variables, and one must decide whether a variable is continuous enough to permit their use. Several measures of correlation are available for use as reliability metrics, but there are important limitations with using correlation to measure agreement. First, correlation is frequently used colloquially to indicate any association between 2 variables, but in statistics, correlation implies only a linear association. Consider the following scatterplots (that graph observed values for 2 variables) and their associated correlation coefficients (in these examples, the Pearson product-moment, which ranges between 1 and 1): (the number of agreement pairs) D (the number of disagreement pairs). A preponderance of agreement pairs (resulting in a large positive value of S) indicates a strong correlation between 2 variables; a preponderance of disagreement pairs (resulting in a large negative value of S) indicates weak correlation. A disadvantage of S is that its range depends on the sample size, but a simple standardization (computed as 2S/n(n 1)) gets around this problem, and Kendall s always ranges between 1 and 1. Spearman s involves a more complicated, less-intuitive calculation 8 and is equivalent to Kendall s in terms of ability to measure correlation. The intraclass correlation (ICC) can also be used to measure reliability. 9 The ICC compares the variance among multiple raters (within a subject) to the overall variance (across all ratings and all subjects). Imagine that 4 physicians (raters) use a Figure 3. Adapted from File:Correlation_examples.png. Because correlation coefficients measure only how well observed data fit with a straight line, a correlation coefficient of zero may indicate that the 2 variables are not associated with each other (or are independent, as in the middle of the top row) or may be missing a more complex but potentially meaningful nonlinear association (as in the bottom row). Another limitation with using correlation is that 2 judges scores could be highly correlated but show little agreement, as in the following example 5 : Subject Rater A Rater B The Kendall 6 and Spearman 7 coefficients measure the degree of correlation between 2 rankings. These coefficients require ordinal and not simply nominal data. Kendall S is a simple way to measure the strength of a relationship in a 2 2 table: S C Figure 4. Adapted from Volume 54, NO. 6 : December 2009 Annals of Emergency Medicine 847

6 decision aid to independently estimate the likelihood of acute coronary syndrome in each of 20 patients (subjects). The 2 graphs (Figure 4) show 4 estimates (1 dot for each physician) for each patient. In the upper graph, the 4 raters give similar ratings for each patient. The variation in ratings for any given patient is small compared with the total variance of all the ratings. Said another way, there is more variance in the ratings among patients than there is in the ratings within patients. A high ICC suggests that the raters have good correlation (when one rater scores a subject high, so do the others). Ratings for each patient tend to be clustered. In the bottom graph, ratings within each subject are all over the place. Here the ICC would be lower as the raters are not highly correlated. A number of ICC estimators have been proposed within the framework of ANOVA. Unfortunately, the various ICC statistics can produce markedly different results when applied to the same data. We believe that the pictures tell the most complete story about agreement and are free from the assumptions made by the various statistics. The Bland-Altman approach is a graphic presentation of agreement data that plots the difference in measurements for each subject pair against their mean. 10 Consider a study that measures peak expiratory flow rate, using 2 different meters in 17 patients: The top scatterplot suggests that the results from each of these 2 meters are similar. However, the bottom graph (a Bland-Altman plot) examines this association in more detail by plotting the differences between the paired measures Figure 5. for each patient (y axis) stratified by the mean of each pair. In this case, this plot confirms that the average difference between the 2 meters is very close to zero. However, the Bland-Altman plot also shows that the difference in readings between the 2 meters can vary up to 80 L/min in some subjects, and that the meters seem to perform differently for lower flow rates than they do with higher flow rates. The principal advantage of this method is that the observed disagreement data can be put into a clinical context. Would differences in measured peak expiratory flow rate up to 80 L/min (especially in sicker patients with lower flow rates) affect patient management? A potential problem with Bland-Altman plots is that patterns can be obscured if the scale of the y axis is not carefully selected. The scale of the y axis needs to be appropriate for the concentration range of the x data (show absolute differences if the range is small but percentage or log-scale differences if the range is larger). 11 ANSWER 3 Q3. Cruz et al quote the oft-cited Landis and Koch article 2 stating that a of less than 0.2 represents poor agreement; 0.21 to 0.40, fair agreement; 0.41 to 0.60, moderate agreement; 0.61 to 0.80, good agreement; and 0.81 to 1.00, excellent agreement. Consider studies of the agreement of airline pilots deciding whether it is safe to land and psychologists deciding whether interviewees have type A or type B personalities. If the studies produced the same numeric value, would the adjectives assigned by Landis and Koch be equally appropriate? Many investigators contrast their values to arbitrary guidelines originally proposed by Landis and Koch 2 and further popularized by Fleiss. 12 As we hope our example demonstrates, the mechanical mapping of numeric values of to the adjectives poor, fair, moderate, good, and excellent is fraught with problems. A of.75 might be good enough if the cost of being wrong is low (such as categorizing subjects into personality types), but nothing less than near-perfect agreement is requisite if the decision has important consequences. We would not be pleased if our airplane s copilots attained a of 0.75 on is it safe to land? Some tests (eg, a set of historical questions that are used to identify patients at high risk for alcohol addiction) might be useful even if their results are only somewhat reliable. Other tests, however (eg, a set of history and physical examination data that are used to identify which patients with traumatic neck pain can safely forgo cervical spine radiography), will be useful only if they are highly reliable. This is because no poorly reliable test will ever be highly valid when used by multiple fallible observers. Conceptualizing any specific degree of agreement as poor, excellent, or anywhere in between regardless of the test s clinical context is, therefore, a dangerous oversimplification. 848 Annals of Emergency Medicine Volume 54, NO. 6 : December 2009

7 ANSWER 4 Q4.a Imagine 2 blindfolded, intelligent individuals who are sitting in distant corners of a room and listening to 100 easy truefalse statements such as red is a color, 2 2 5, etc, over a loudspeaker. Each indicates his or her choice by pressing a button in the left hand for false and in the right for true. Questions are not repeated and the respondents are expected to offer a response for each statement. Verify that if they agree on all 100 answers, percentage agreement is 100 and is 1.0, regardless of how many of statements are true and how many are false. Now imagine that the testing site is under the final approach for a major airport, and, at times, noise from jets flying overhead drowns out the statements from the loudspeaker. When this occurs, respondents agree, on average, only half the time (as one would expect). Recalculate percentage agreement and for the same 100-statement test conducted under the following sets of conditions: (1) half the statements are true and 1% of the statements are rendered incomprehensible by the planes; (2) 90% of the statements are true and 1% of the statements are rendered incomprehensible by the planes; (3) half the statements are true and 20% of the statements are rendered incomprehensible by the planes; (4) 90% of the statements are true and 20% of the statements are rendered incomprehensible by the planes. Discuss the meaning of percentage agreement and in these 4 settings. This example is designed to show that in certain situations, can underestimate actual agreement, particularly when there are skewed marginals and when percentage agreement is fairly high. Recall that (% observed agreement % expected agreement)/(100 % expected agreement). Thus, when 2 raters agree on all observations, regardless of how these are distributed between true and false or what the expected agreements are, 1. Tables 1 and 2 show 2 of the possible results under condition 1. Note that we do not know whether the one incomprehensible statement was true or false or how each rater will classify it. We do know that because only 1 of the 100 observations was made with low confidence, all possible results will yield very similar percentage agreement and. Tables 1 and 2 Tables 3 and 4 show 2 possible results under condition 2. The preponderance of true questions has skewed the marginal totals. Consequently, how each rater classifies the one unheard question affects a bit more than when the marginals are roughly equal. Tables 3 and 4 Recognize that in these first 2 sets of conditions, the raters are asked to rate 99 easy statements and 1 hard (plane flying overhead) statement. The expected agreement should be the same in all 4 tables, yet the value of percentage agreement expected due to chance, as calculated for, changes. is lower in Table 4 than in Table 2, even though raters are performing equally well in both. Observed percentage agreement also varies, but only slightly. Table 5 shows the most likely result under condition 3 (50% of statements are true and 20% are incomprehensible). This can be derived by considering the 80 high-confidence and 20 low-confidence classifications separately (Tables 6 and 7). The audible questions will result in Table 6. If each rater has no knowledge about how often inaudible statements are true, then the modal result for the 20 inaudible questions is that depicted in Table 7. Summing Tables 6 and 7 results in Table 5. Of course, by chance alone, the 20 unheard questions might result in a more skewed table, like either Table 8 (all observations falling into disagreement cells b or c) or Table 9 (all observations falling into agreement cells a or d). Summing these results to Table 6 yields Tables 10 and 11 respectively. Thus, depending on how the incomprehensible 20 questions end up being classified, percentage agreement and could range between 80% and to 100% and 1. Table 12 shows a possible result under condition 4 (90% of statements are true and 20% are incomprehensible), again derived by considering the 80 high-confidence and 20 lowconfidence classifications separately. The audible questions will result in Table 13. If both raters believe that 90% of the unheard questions are true, those 20 questions might result in Table 14. is negative for this table because the observed agreement (80%) is less than that expected due to chance (( ) ( ) %). Summing Tables 13 and 14 results in Table 12. The 20 unheard questions might again result in a more skewed table. It seems unlikely that either rater would believe that 90% of all the questions were true and still classify all of the unheard questions as false (Table 15), but we have no assurance that this could not happen. Both raters could simply classify all the unheard questions as true (Table 16). Thus, depending on how the incomprehensible 20 questions end up being classified, percentage agreement and could range between 80% and (summing Tables 13 and 15) to 100% and 1 (summing Tables 13 and 16). Volume 54, NO. 6 : December 2009 Annals of Emergency Medicine 849

8 Tables 5, 6 and 7 Tables 8, 9, 10, and 11 Tables 12, 13, and 14 Tables 15, 16, 17, and 18 We summarize the range of percentage agreement and results for the 4 scenarios: Scenario Type (Range % Agreement, Range ) Percentage of T/F Statements 1% Inaudible 20% Inaudible 50/ %, %, / %, %, We remind readers that the skill of the raters is the same within each column; when raters can hear the statement, they always agree. Despite this, simply varying the percentage of true statements can result in a wide range of, even though agreement should be identical. Also, the range of widens as the proportion of lowconfidence classification increases. The implicit assumption of the statistic is that expected agreement due to chance, as calculated from the marginal totals, is an unbiased estimate of actual chance agreement. Our examples illustrate that that assumption is often false and can lead to values that underestimate actual agreement, particularly when agreement is fairly high. 13,14 Q4.b Consider the 2 tables below and calculate percentage agreement and for each. Why is lower on the right? What does this mean? This question reinforces the concepts illustrated in question 4a. The percentage agreement is the same in both of these tables, but is lower in the right table because its marginal totals are skewed. multiplies the a b and a c marginals together (and then sums this to the multiplied c d and b d marginals) to determine agreement expected by chance. Calculated expected agreement is higher when marginals are skewed compared with when the marginals are equal. 850 Annals of Emergency Medicine Volume 54, NO. 6 : December 2009

9 is lower when the marginals are skewed because of how defines chance agreement. assumes that when there are skewed marginal totals, there is likely to be more chance agreement. In certain situations, this makes sense. If 2 blindfolded independent raters say yes or no, on each of 100 times a fair roulette wheel with 90% of slots marked yes and 10% marked no is spun, they will likely have margins of 90 and 10, and we would expect that their agreement due to chance would be 82% ( ). They certainly would have to do far better than 82% for us to start wondering about clairvoyance or cheating. But if those same raters are told that the roulette wheel could be marked any way from all slots being yes to all slots being no, then we would expect our raters to have marginals of 0.5 and 0.5 and to agree by chance 50% of the time ( ). assumes that skewed marginals always indicate agreement due to chance; but there are many situations (as in the examples in question 4a) in which skewed marginals result from the raters making high-quality judgments. We find it particularly irksome when is used to measure the interrater reliability of inanimate mechanical devices. Imagine 2 bedside pregnancy tests that produce a positive or negative test based on some form of immunoassay. The tests do not know what the correct marginals should be, so that any deviation of the marginals cannot be assumed to produce chance agreement in the inner cells. Certainly if both tests were wholly invalid (there was no reagent on the cards, so they read negative every time), then there would be 100% agreement, but this would not be chance agreement. For tests that do not know what the right marginals are, percentage agreement is a simple and sufficient summary statistic, although, as noted below, reporting the actual 2 2 table is the best method of communicating reliability results. Q4.c To further consider the meaning of, imagine (for the right-hand panel above) that planes flew overhead such that 60 statements were heard perfectly and 40 were barely comprehensible or not heard at all. Below are separate tables for the 60 audible and 40 incomprehensible statements. Calculate percentage agreement and for these tables. Which is the better measure for each? Consider the confidence level of the raters in the different scenarios presented in this exercise. Should rater confidence be considered when interrater reliability is described? How might this be done? fails to distinguish between 2 phenomena. In the first, raters make ratings with high confidence. This produces a high percentage agreement. Depending on whether the values of what they are rating are evenly distributed (a coin toss) or skewed (a dice roll in which 1 is true and the other 5 numbers are false), the marginals will be even or skewed. Note that in this example, the confidence of the raters determines the values of the inner cells and the marginals reflect the values of these inner cells. In the alternate phenomenon, raters have low confidence. Because they do not know what to make of individual observations (akin to having to say true or false when they cannot hear the statement because of the airplanes), they rely on their knowledge of the marginals to guide their choices. As a result, it is the value of the marginals that determines that value of the inner cells, and agreement is largely mediated by chance. The problem is that we typically cannot tell which phenomenon is occurring. We get to see one table and we have no way to break it down into high-confidence and low-confidence subtables. The left table has perfect agreement, and the margins here are highly skewed. Percentage agreement and approach 100% and 1, respectively, and the choice of statistic makes little difference. In analyzing the right-hand table, we encounter the same problems discussed in question 4b. These raters had no ability to discern the individual statements, and presumably guessed based on their knowledge of the marginals (ie, their experience with the audible statements). Observed agreement here is thus highly subject to chance occurrence based on the marginal values, and thus is an appropriate statistic for this table. ANSWER 5 Q5. Finally, the following graph shows percentage agreement versus for the first 50 items in Table 1 of Cruz et al. The points are shaded to indicate how many subjects fall into the smallest cell in the 2 2 table. kappa Kappa vs. % Agreement, stratified on size of smallest cell 54 historical items from Table 1, Cruz et al, n=143 Size of smallest cell < % 20% 40% 60% 80% 100% Percent agreement Q5.a Four lines in the table are denoted with square markers (near the arrow) on the graph (Is pain burning? Does it radiate to the back? Does it radiate to the jaw? Does it radiate to the left arm?). Create (approximate) 2 2 tables for these 4 points. Can you explain why these tables have similar percentage agreement but varying s? Which do you believe is the better measure? Why do the s differ? Volume 54, NO. 6 : December 2009 Annals of Emergency Medicine 851

10 These tables all show similar percentage agreement. Each successive table has an increasing because its marginals are less skewed than those of the previous table. Compare the sums of the a b and a c marginals with the sums of the c d and b d marginals for each table: 263 and 23, 259 and 27, 250 and 36, 228 and 58. In each successive table, these 2 numbers get closer together (less skewed). uses these marginal values to calculate percentage agreement due to chance. The best metric of reliability is that which summarizes the raw reliability data in the most useful way. Percentage agreement is easy to understand, but it is limited, which is important, in that it does not account for agreement that occurred simply by chance. How, then, should we think of and define chance agreement? Like all classic statistic techniques, makes a rigid and narrow assumption (that all observations are independent, identically distributed, and drawn from the same probability density function) to use the observed marginal values to calculate chance agreement. It follows, then, that if each physician and research assistant asked these 4 questions, but overflying airplanes prevented any of them from hearing the responses clearly, and they had no previous information about the quality or radiation of chest pain that might guide their guesses toward some target marginals, then the assumptions of are reasonably valid and we could infer that the question, Does it radiate to the left arm? performed better than Is pain burning? because the first showed less chance agreement (despite equal raw agreement). It could be more meaningful and useful, however, to think of and define chance agreement as that which occurs under conditions of low confidence. Imagine that resident physicians assess 100 patients with pericardial effusion for ultrasonographic evidence of right ventricular diastolic collapse. Each rater makes a dichotomous (yes/no) assessment of right ventricle collapse for each patient and also records his or her subjective confidence for that assessment. Confidence can be assessed categorically (high versus low, high/medium/low, quartiles) or continuously (mark a point on a confidence line, with anchors not at all confident and extremely confident ). If confidence were converted to a number ranging between 0 and 100, corresponding to where the users put a mark on a continuous confidence line, study data might result in: Patient # Rater A RAC (0-100) Rater B RBC (0-100) 1 Yes 42 Yes 96 2 No 12 Yes 28 3 Yes 65 No 70 4 No 90 No No 62 No 48 RAC, Rater A confidence; RBC, Rater B confidence. The numbers in these tables were deliberately chosen to illustrate a point: although the overall table suggests high agreement (and ) for these ultrasonographers, the stratified data tell a different story. Most of the agreement here occurred with observations for which one or both raters reported low confidence (analogous to answering a simple true/false question that was obscured by an overflying airplane). It seems intuitive to assume that some chance agreement occurs under conditions of uncertainty. In contrast, there was only 43% agreement for the high-confidence assessments (analogous to answering Does 2 2 4? with no airplanes nearby). If assessments are truly being made with high confidence, should any part of that be attributed to chance? An assessment of rater confidence has several obvious and important limitations. Different raters likely assess their own confidence in markedly different ways and may tend to cluster their confidence assessments in a particular range. There is also no single criterion standard to assess the validity of confidence ratings. It may be that many of the ultrasonographers in this 852 Annals of Emergency Medicine Volume 54, NO. 6 : December 2009

11 example were discerning RV function very accurately but just did not feel confident in their assessments (perhaps because of lack of experience). Despite these limitations, assessing the subjective confidence of ratings may be a more useful approach to defining and accounting for the effect of chance agreement than. Q5.b Can you comment on the relationship between the size of the smallest cell in the 2 2 table and the extent to which may deviate from percentage agreement? The graphic shows that is most likely to deviate from a linear relationship to percentage agreement when (1) percentage agreement is high and (2) there is at least 1 cell with a small N. The presence of a cell with sparse data suggests that the marginals are skewed, and, as discussed in the answer to 5a, as marginals become skewed expected agreement increases and (for a given percentage agreement) decreases. This graphic shows that the variation in among the questions in Table 1 of the Cruz et al article may have less to do with the difficulty of the question than of the rarity of yes (or no ) answers. Those questions in which the majority of respondents provide the same answer are likely to have lower s, regardless of the true reliability of the measure. Q5.c Given the problems with both percentage agreement and illustrated in these examples, do you think it would be better if investigators presented the 4 numbers in the inner cells of each 2 2 table, instead of reporting the percentage agreement or? We hope that this Journal Club has made readers aware of the oversimplifications and distortions that can occur when a 2 2 table (or more complex data structure) is reduced to a single reliability metric such as. If an experimental design warrants consideration of interrater reliability, then investigators should strongly consider reporting the actual interrater reliability data rather than percentage agreement or. This information could go in an online-only supplement if it is too bulky to go in the main article. Section editors: Tyler W. Barrett, MD; David L. Schriger, MD, MPH REFERENCES 1. Cruz CO, Meshberg EG, Shofer FS, et al. Interrater reliability and accuracy of clinicians and trained research assistants performing prospective data collection in emergency department patients with potential acute coronary syndrome. Ann Emerg Med. 2009; 54: Landis JR, Koch GC. The measurement of observer agreement for categorical data. Biometrics. 1977;33: Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Measure. 1960;20: Uebersax J. Statistical methods for rater agreement. Available at: htm. Accessed May 31, Wuensch KL. Inter-rater agreement. Available at: Accessed May 18, Kendall M. A new measure of rank correlation. Biometrika 1938; 30: Spearman C. The proof and measurement of association between two things. Am J Psychol. 1904;15: Noether GE. Why Kendall tau? Available at: ts/bts/noether/text.html. Accessed May 18, Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull. 1979;86: Bland JM, Altman DG: Statistical methods for assessing agreement between two methods of clinical measurement. Lancet. 1986;1: Dewitte K, Fierens C, Stockl D, et al. Application of the Bland- Altman plot for interpretation of method-comparison studies: a critical investigation of its practice. Clin Chem. 2002;48: Fleiss JL. Statistical Methods for Rates and Proportions. 2nd ed. New York, NY: John Wiley & Sons; Feinstein AR, Cicchetti DV. High agreement but low kappa, I: the problems of two paradoxes. J Clin Epidemiol. 1990;43: Cicchetti DV, Feinstein AR. High agreement but low kappa, II: resolving the paradoxes. J Clin Epidemiol. 1990;43: Volume 54, NO. 6 : December 2009 Annals of Emergency Medicine 853

A review of statistical methods in the analysis of data arising from observer reliability studies (Part 11) *

A review of statistical methods in the analysis of data arising from observer reliability studies (Part 11) * A review of statistical methods in the analysis of data arising from observer reliability studies (Part 11) * by J. RICHARD LANDIS** and GARY G. KOCH** 4 Methods proposed for nominal and ordinal data Many

More information

Results & Statistics: Description and Correlation. I. Scales of Measurement A Review

Results & Statistics: Description and Correlation. I. Scales of Measurement A Review Results & Statistics: Description and Correlation The description and presentation of results involves a number of topics. These include scales of measurement, descriptive statistics used to summarize

More information

Unit 1 Exploring and Understanding Data

Unit 1 Exploring and Understanding Data Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile

More information

On the purpose of testing:

On the purpose of testing: Why Evaluation & Assessment is Important Feedback to students Feedback to teachers Information to parents Information for selection and certification Information for accountability Incentives to increase

More information

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions Readings: OpenStax Textbook - Chapters 1 5 (online) Appendix D & E (online) Plous - Chapters 1, 5, 6, 13 (online) Introductory comments Describe how familiarity with statistical methods can - be associated

More information

CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA

CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA Data Analysis: Describing Data CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA In the analysis process, the researcher tries to evaluate the data collected both from written documents and from other sources such

More information

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo Please note the page numbers listed for the Lind book may vary by a page or two depending on which version of the textbook you have. Readings: Lind 1 11 (with emphasis on chapters 10, 11) Please note chapter

More information

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions Readings: OpenStax Textbook - Chapters 1 5 (online) Appendix D & E (online) Plous - Chapters 1, 5, 6, 13 (online) Introductory comments Describe how familiarity with statistical methods can - be associated

More information

Business Statistics Probability

Business Statistics Probability Business Statistics The following was provided by Dr. Suzanne Delaney, and is a comprehensive review of Business Statistics. The workshop instructor will provide relevant examples during the Skills Assessment

More information

EPIDEMIOLOGY. Training module

EPIDEMIOLOGY. Training module 1. Scope of Epidemiology Definitions Clinical epidemiology Epidemiology research methods Difficulties in studying epidemiology of Pain 2. Measures used in Epidemiology Disease frequency Disease risk Disease

More information

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo Business Statistics The following was provided by Dr. Suzanne Delaney, and is a comprehensive review of Business Statistics. The workshop instructor will provide relevant examples during the Skills Assessment

More information

CHAPTER ONE CORRELATION

CHAPTER ONE CORRELATION CHAPTER ONE CORRELATION 1.0 Introduction The first chapter focuses on the nature of statistical data of correlation. The aim of the series of exercises is to ensure the students are able to use SPSS to

More information

Still important ideas

Still important ideas Readings: OpenStax - Chapters 1 11 + 13 & Appendix D & E (online) Plous - Chapters 2, 3, and 4 Chapter 2: Cognitive Dissonance, Chapter 3: Memory and Hindsight Bias, Chapter 4: Context Dependence Still

More information

Lessons in biostatistics

Lessons in biostatistics Lessons in biostatistics : the kappa statistic Mary L. McHugh Department of Nursing, National University, Aero Court, San Diego, California Corresponding author: mchugh8688@gmail.com Abstract The kappa

More information

Unit 7 Comparisons and Relationships

Unit 7 Comparisons and Relationships Unit 7 Comparisons and Relationships Objectives: To understand the distinction between making a comparison and describing a relationship To select appropriate graphical displays for making comparisons

More information

COMPUTING READER AGREEMENT FOR THE GRE

COMPUTING READER AGREEMENT FOR THE GRE RM-00-8 R E S E A R C H M E M O R A N D U M COMPUTING READER AGREEMENT FOR THE GRE WRITING ASSESSMENT Donald E. Powers Princeton, New Jersey 08541 October 2000 Computing Reader Agreement for the GRE Writing

More information

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo Please note the page numbers listed for the Lind book may vary by a page or two depending on which version of the textbook you have. Readings: Lind 1 11 (with emphasis on chapters 5, 6, 7, 8, 9 10 & 11)

More information

Still important ideas

Still important ideas Readings: OpenStax - Chapters 1 13 & Appendix D & E (online) Plous Chapters 17 & 18 - Chapter 17: Social Influences - Chapter 18: Group Judgments and Decisions Still important ideas Contrast the measurement

More information

Quantitative Methods in Computing Education Research (A brief overview tips and techniques)

Quantitative Methods in Computing Education Research (A brief overview tips and techniques) Quantitative Methods in Computing Education Research (A brief overview tips and techniques) Dr Judy Sheard Senior Lecturer Co-Director, Computing Education Research Group Monash University judy.sheard@monash.edu

More information

CHAPTER 2. MEASURING AND DESCRIBING VARIABLES

CHAPTER 2. MEASURING AND DESCRIBING VARIABLES 4 Chapter 2 CHAPTER 2. MEASURING AND DESCRIBING VARIABLES 1. A. Age: name/interval; military dictatorship: value/nominal; strongly oppose: value/ ordinal; election year: name/interval; 62 percent: value/interval;

More information

Readings: Textbook readings: OpenStax - Chapters 1 11 Online readings: Appendix D, E & F Plous Chapters 10, 11, 12 and 14

Readings: Textbook readings: OpenStax - Chapters 1 11 Online readings: Appendix D, E & F Plous Chapters 10, 11, 12 and 14 Readings: Textbook readings: OpenStax - Chapters 1 11 Online readings: Appendix D, E & F Plous Chapters 10, 11, 12 and 14 Still important ideas Contrast the measurement of observable actions (and/or characteristics)

More information

Measuring the User Experience

Measuring the User Experience Measuring the User Experience Collecting, Analyzing, and Presenting Usability Metrics Chapter 2 Background Tom Tullis and Bill Albert Morgan Kaufmann, 2008 ISBN 978-0123735584 Introduction Purpose Provide

More information

alternate-form reliability The degree to which two or more versions of the same test correlate with one another. In clinical studies in which a given function is going to be tested more than once over

More information

Empirical Knowledge: based on observations. Answer questions why, whom, how, and when.

Empirical Knowledge: based on observations. Answer questions why, whom, how, and when. INTRO TO RESEARCH METHODS: Empirical Knowledge: based on observations. Answer questions why, whom, how, and when. Experimental research: treatments are given for the purpose of research. Experimental group

More information

Biostatistics. Donna Kritz-Silverstein, Ph.D. Professor Department of Family & Preventive Medicine University of California, San Diego

Biostatistics. Donna Kritz-Silverstein, Ph.D. Professor Department of Family & Preventive Medicine University of California, San Diego Biostatistics Donna Kritz-Silverstein, Ph.D. Professor Department of Family & Preventive Medicine University of California, San Diego (858) 534-1818 dsilverstein@ucsd.edu Introduction Overview of statistical

More information

Chapter 2 Norms and Basic Statistics for Testing MULTIPLE CHOICE

Chapter 2 Norms and Basic Statistics for Testing MULTIPLE CHOICE Chapter 2 Norms and Basic Statistics for Testing MULTIPLE CHOICE 1. When you assert that it is improbable that the mean intelligence test score of a particular group is 100, you are using. a. descriptive

More information

Medical Statistics 1. Basic Concepts Farhad Pishgar. Defining the data. Alive after 6 months?

Medical Statistics 1. Basic Concepts Farhad Pishgar. Defining the data. Alive after 6 months? Medical Statistics 1 Basic Concepts Farhad Pishgar Defining the data Population and samples Except when a full census is taken, we collect data on a sample from a much larger group called the population.

More information

02a: Test-Retest and Parallel Forms Reliability

02a: Test-Retest and Parallel Forms Reliability 1 02a: Test-Retest and Parallel Forms Reliability Quantitative Variables 1. Classic Test Theory (CTT) 2. Correlation for Test-retest (or Parallel Forms): Stability and Equivalence for Quantitative Measures

More information

Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F

Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F Plous Chapters 17 & 18 Chapter 17: Social Influences Chapter 18: Group Judgments and Decisions

More information

Undertaking statistical analysis of

Undertaking statistical analysis of Descriptive statistics: Simply telling a story Laura Delaney introduces the principles of descriptive statistical analysis and presents an overview of the various ways in which data can be presented by

More information

Introduction On Assessing Agreement With Continuous Measurement

Introduction On Assessing Agreement With Continuous Measurement Introduction On Assessing Agreement With Continuous Measurement Huiman X. Barnhart, Michael Haber, Lawrence I. Lin 1 Introduction In social, behavioral, physical, biological and medical sciences, reliable

More information

Chapter 2--Norms and Basic Statistics for Testing

Chapter 2--Norms and Basic Statistics for Testing Chapter 2--Norms and Basic Statistics for Testing Student: 1. Statistical procedures that summarize and describe a series of observations are called A. inferential statistics. B. descriptive statistics.

More information

Georgina Salas. Topics EDCI Intro to Research Dr. A.J. Herrera

Georgina Salas. Topics EDCI Intro to Research Dr. A.J. Herrera Homework assignment topics 51-63 Georgina Salas Topics 51-63 EDCI Intro to Research 6300.62 Dr. A.J. Herrera Topic 51 1. Which average is usually reported when the standard deviation is reported? The mean

More information

STATISTICS AND RESEARCH DESIGN

STATISTICS AND RESEARCH DESIGN Statistics 1 STATISTICS AND RESEARCH DESIGN These are subjects that are frequently confused. Both subjects often evoke student anxiety and avoidance. To further complicate matters, both areas appear have

More information

Observational studies; descriptive statistics

Observational studies; descriptive statistics Observational studies; descriptive statistics Patrick Breheny August 30 Patrick Breheny University of Iowa Biostatistical Methods I (BIOS 5710) 1 / 38 Observational studies Association versus causation

More information

(true) Disease Condition Test + Total + a. a + b True Positive False Positive c. c + d False Negative True Negative Total a + c b + d a + b + c + d

(true) Disease Condition Test + Total + a. a + b True Positive False Positive c. c + d False Negative True Negative Total a + c b + d a + b + c + d Biostatistics and Research Design in Dentistry Reading Assignment Measuring the accuracy of diagnostic procedures and Using sensitivity and specificity to revise probabilities, in Chapter 12 of Dawson

More information

HOW STATISTICS IMPACT PHARMACY PRACTICE?

HOW STATISTICS IMPACT PHARMACY PRACTICE? HOW STATISTICS IMPACT PHARMACY PRACTICE? CPPD at NCCR 13 th June, 2013 Mohamed Izham M.I., PhD Professor in Social & Administrative Pharmacy Learning objective.. At the end of the presentation pharmacists

More information

Readings: Textbook readings: OpenStax - Chapters 1 4 Online readings: Appendix D, E & F Online readings: Plous - Chapters 1, 5, 6, 13

Readings: Textbook readings: OpenStax - Chapters 1 4 Online readings: Appendix D, E & F Online readings: Plous - Chapters 1, 5, 6, 13 Readings: Textbook readings: OpenStax - Chapters 1 4 Online readings: Appendix D, E & F Online readings: Plous - Chapters 1, 5, 6, 13 Introductory comments Describe how familiarity with statistical methods

More information

3 CONCEPTUAL FOUNDATIONS OF STATISTICS

3 CONCEPTUAL FOUNDATIONS OF STATISTICS 3 CONCEPTUAL FOUNDATIONS OF STATISTICS In this chapter, we examine the conceptual foundations of statistics. The goal is to give you an appreciation and conceptual understanding of some basic statistical

More information

Psychology Research Process

Psychology Research Process Psychology Research Process Logical Processes Induction Observation/Association/Using Correlation Trying to assess, through observation of a large group/sample, what is associated with what? Examples:

More information

Appendix B Statistical Methods

Appendix B Statistical Methods Appendix B Statistical Methods Figure B. Graphing data. (a) The raw data are tallied into a frequency distribution. (b) The same data are portrayed in a bar graph called a histogram. (c) A frequency polygon

More information

Essential Skills for Evidence-based Practice: Statistics for Therapy Questions

Essential Skills for Evidence-based Practice: Statistics for Therapy Questions Essential Skills for Evidence-based Practice: Statistics for Therapy Questions Jeanne Grace Corresponding author: J. Grace E-mail: Jeanne_Grace@urmc.rochester.edu Jeanne Grace RN PhD Emeritus Clinical

More information

Statistics is a broad mathematical discipline dealing with

Statistics is a broad mathematical discipline dealing with Statistical Primer for Cardiovascular Research Descriptive Statistics and Graphical Displays Martin G. Larson, SD Statistics is a broad mathematical discipline dealing with techniques for the collection,

More information

PTHP 7101 Research 1 Chapter Assignments

PTHP 7101 Research 1 Chapter Assignments PTHP 7101 Research 1 Chapter Assignments INSTRUCTIONS: Go over the questions/pointers pertaining to the chapters and turn in a hard copy of your answers at the beginning of class (on the day that it is

More information

Analysis of Environmental Data Conceptual Foundations: En viro n m e n tal Data

Analysis of Environmental Data Conceptual Foundations: En viro n m e n tal Data Analysis of Environmental Data Conceptual Foundations: En viro n m e n tal Data 1. Purpose of data collection...................................................... 2 2. Samples and populations.......................................................

More information

Week 17 and 21 Comparing two assays and Measurement of Uncertainty Explain tools used to compare the performance of two assays, including

Week 17 and 21 Comparing two assays and Measurement of Uncertainty Explain tools used to compare the performance of two assays, including Week 17 and 21 Comparing two assays and Measurement of Uncertainty 2.4.1.4. Explain tools used to compare the performance of two assays, including 2.4.1.4.1. Linear regression 2.4.1.4.2. Bland-Altman plots

More information

REVIEW ARTICLE. A Review of Inferential Statistical Methods Commonly Used in Medicine

REVIEW ARTICLE. A Review of Inferential Statistical Methods Commonly Used in Medicine A Review of Inferential Statistical Methods Commonly Used in Medicine JCD REVIEW ARTICLE A Review of Inferential Statistical Methods Commonly Used in Medicine Kingshuk Bhattacharjee a a Assistant Manager,

More information

A study of adverse reaction algorithms in a drug surveillance program

A study of adverse reaction algorithms in a drug surveillance program A study of adverse reaction algorithms in a drug surveillance program To improve agreement among observers, several investigators have recently proposed methods (algorithms) to standardize assessments

More information

Unequal Numbers of Judges per Subject

Unequal Numbers of Judges per Subject The Reliability of Dichotomous Judgments: Unequal Numbers of Judges per Subject Joseph L. Fleiss Columbia University and New York State Psychiatric Institute Jack Cuzick Columbia University Consider a

More information

EXERCISE: HOW TO DO POWER CALCULATIONS IN OPTIMAL DESIGN SOFTWARE

EXERCISE: HOW TO DO POWER CALCULATIONS IN OPTIMAL DESIGN SOFTWARE ...... EXERCISE: HOW TO DO POWER CALCULATIONS IN OPTIMAL DESIGN SOFTWARE TABLE OF CONTENTS 73TKey Vocabulary37T... 1 73TIntroduction37T... 73TUsing the Optimal Design Software37T... 73TEstimating Sample

More information

Introduction to Statistical Data Analysis I

Introduction to Statistical Data Analysis I Introduction to Statistical Data Analysis I JULY 2011 Afsaneh Yazdani Preface What is Statistics? Preface What is Statistics? Science of: designing studies or experiments, collecting data Summarizing/modeling/analyzing

More information

COMMITMENT &SOLUTIONS UNPARALLELED. Assessing Human Visual Inspection for Acceptance Testing: An Attribute Agreement Analysis Case Study

COMMITMENT &SOLUTIONS UNPARALLELED. Assessing Human Visual Inspection for Acceptance Testing: An Attribute Agreement Analysis Case Study DATAWorks 2018 - March 21, 2018 Assessing Human Visual Inspection for Acceptance Testing: An Attribute Agreement Analysis Case Study Christopher Drake Lead Statistician, Small Caliber Munitions QE&SA Statistical

More information

Analysis and Interpretation of Data Part 1

Analysis and Interpretation of Data Part 1 Analysis and Interpretation of Data Part 1 DATA ANALYSIS: PRELIMINARY STEPS 1. Editing Field Edit Completeness Legibility Comprehensibility Consistency Uniformity Central Office Edit 2. Coding Specifying

More information

Statistics for Psychology

Statistics for Psychology Statistics for Psychology SIXTH EDITION CHAPTER 3 Some Key Ingredients for Inferential Statistics Some Key Ingredients for Inferential Statistics Psychologists conduct research to test a theoretical principle

More information

Experimental Psychology

Experimental Psychology Title Experimental Psychology Type Individual Document Map Authors Aristea Theodoropoulos, Patricia Sikorski Subject Social Studies Course None Selected Grade(s) 11, 12 Location Roxbury High School Curriculum

More information

Chapter 23. Inference About Means. Copyright 2010 Pearson Education, Inc.

Chapter 23. Inference About Means. Copyright 2010 Pearson Education, Inc. Chapter 23 Inference About Means Copyright 2010 Pearson Education, Inc. Getting Started Now that we know how to create confidence intervals and test hypotheses about proportions, it d be nice to be able

More information

Chapter 1: Explaining Behavior

Chapter 1: Explaining Behavior Chapter 1: Explaining Behavior GOAL OF SCIENCE is to generate explanations for various puzzling natural phenomenon. - Generate general laws of behavior (psychology) RESEARCH: principle method for acquiring

More information

9 research designs likely for PSYC 2100

9 research designs likely for PSYC 2100 9 research designs likely for PSYC 2100 1) 1 factor, 2 levels, 1 group (one group gets both treatment levels) related samples t-test (compare means of 2 levels only) 2) 1 factor, 2 levels, 2 groups (one

More information

Chapter 1: Introduction to Statistics

Chapter 1: Introduction to Statistics Chapter 1: Introduction to Statistics Variables A variable is a characteristic or condition that can change or take on different values. Most research begins with a general question about the relationship

More information

Standard Scores. Richard S. Balkin, Ph.D., LPC-S, NCC

Standard Scores. Richard S. Balkin, Ph.D., LPC-S, NCC Standard Scores Richard S. Balkin, Ph.D., LPC-S, NCC 1 Normal Distributions While Best and Kahn (2003) indicated that the normal curve does not actually exist, measures of populations tend to demonstrate

More information

Assessing Agreement Between Methods Of Clinical Measurement

Assessing Agreement Between Methods Of Clinical Measurement University of York Department of Health Sciences Measuring Health and Disease Assessing Agreement Between Methods Of Clinical Measurement Based on Bland JM, Altman DG. (1986). Statistical methods for assessing

More information

DATA is derived either through. Self-Report Observation Measurement

DATA is derived either through. Self-Report Observation Measurement Data Management DATA is derived either through Self-Report Observation Measurement QUESTION ANSWER DATA DATA may be from Structured or Unstructured questions? Quantitative or Qualitative? Numerical or

More information

Chapter 1: Exploring Data

Chapter 1: Exploring Data Chapter 1: Exploring Data Key Vocabulary:! individual! variable! frequency table! relative frequency table! distribution! pie chart! bar graph! two-way table! marginal distributions! conditional distributions!

More information

Chapter 7: Descriptive Statistics

Chapter 7: Descriptive Statistics Chapter Overview Chapter 7 provides an introduction to basic strategies for describing groups statistically. Statistical concepts around normal distributions are discussed. The statistical procedures of

More information

Discrimination Weighting on a Multiple Choice Exam

Discrimination Weighting on a Multiple Choice Exam Proceedings of the Iowa Academy of Science Volume 75 Annual Issue Article 44 1968 Discrimination Weighting on a Multiple Choice Exam Timothy J. Gannon Loras College Thomas Sannito Loras College Copyright

More information

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES Correlational Research Correlational Designs Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are

More information

VARIABLES AND MEASUREMENT

VARIABLES AND MEASUREMENT ARTHUR SYC 204 (EXERIMENTAL SYCHOLOGY) 16A LECTURE NOTES [01/29/16] VARIABLES AND MEASUREMENT AGE 1 Topic #3 VARIABLES AND MEASUREMENT VARIABLES Some definitions of variables include the following: 1.

More information

Lessons in biostatistics

Lessons in biostatistics Lessons in biostatistics The test of independence Mary L. McHugh Department of Nursing, School of Health and Human Services, National University, Aero Court, San Diego, California, USA Corresponding author:

More information

Collecting & Making Sense of

Collecting & Making Sense of Collecting & Making Sense of Quantitative Data Deborah Eldredge, PhD, RN Director, Quality, Research & Magnet Recognition i Oregon Health & Science University Margo A. Halm, RN, PhD, ACNS-BC, FAHA Director,

More information

HW 1 - Bus Stat. Student:

HW 1 - Bus Stat. Student: HW 1 - Bus Stat Student: 1. An identification of police officers by rank would represent a(n) level of measurement. A. Nominative C. Interval D. Ratio 2. A(n) variable is a qualitative variable such that

More information

WDHS Curriculum Map Probability and Statistics. What is Statistics and how does it relate to you?

WDHS Curriculum Map Probability and Statistics. What is Statistics and how does it relate to you? WDHS Curriculum Map Probability and Statistics Time Interval/ Unit 1: Introduction to Statistics 1.1-1.3 2 weeks S-IC-1: Understand statistics as a process for making inferences about population parameters

More information

How to interpret scientific & statistical graphs

How to interpret scientific & statistical graphs How to interpret scientific & statistical graphs Theresa A Scott, MS Department of Biostatistics theresa.scott@vanderbilt.edu http://biostat.mc.vanderbilt.edu/theresascott 1 A brief introduction Graphics:

More information

2.75: 84% 2.5: 80% 2.25: 78% 2: 74% 1.75: 70% 1.5: 66% 1.25: 64% 1.0: 60% 0.5: 50% 0.25: 25% 0: 0%

2.75: 84% 2.5: 80% 2.25: 78% 2: 74% 1.75: 70% 1.5: 66% 1.25: 64% 1.0: 60% 0.5: 50% 0.25: 25% 0: 0% Capstone Test (will consist of FOUR quizzes and the FINAL test grade will be an average of the four quizzes). Capstone #1: Review of Chapters 1-3 Capstone #2: Review of Chapter 4 Capstone #3: Review of

More information

Comparison of the Null Distributions of

Comparison of the Null Distributions of Comparison of the Null Distributions of Weighted Kappa and the C Ordinal Statistic Domenic V. Cicchetti West Haven VA Hospital and Yale University Joseph L. Fleiss Columbia University It frequently occurs

More information

Section 3.2 Least-Squares Regression

Section 3.2 Least-Squares Regression Section 3.2 Least-Squares Regression Linear relationships between two quantitative variables are pretty common and easy to understand. Correlation measures the direction and strength of these relationships.

More information

4 Diagnostic Tests and Measures of Agreement

4 Diagnostic Tests and Measures of Agreement 4 Diagnostic Tests and Measures of Agreement Diagnostic tests may be used for diagnosis of disease or for screening purposes. Some tests are more effective than others, so we need to be able to measure

More information

The recommended method for diagnosing sleep

The recommended method for diagnosing sleep reviews Measuring Agreement Between Diagnostic Devices* W. Ward Flemons, MD; and Michael R. Littner, MD, FCCP There is growing interest in using portable monitoring for investigating patients with suspected

More information

Student Performance Q&A:

Student Performance Q&A: Student Performance Q&A: 2009 AP Statistics Free-Response Questions The following comments on the 2009 free-response questions for AP Statistics were written by the Chief Reader, Christine Franklin of

More information

Reliability and Validity checks S-005

Reliability and Validity checks S-005 Reliability and Validity checks S-005 Checking on reliability of the data we collect Compare over time (test-retest) Item analysis Internal consistency Inter-rater agreement Compare over time Test-Retest

More information

Using Differential Item Functioning to Test for Inter-rater Reliability in Constructed Response Items

Using Differential Item Functioning to Test for Inter-rater Reliability in Constructed Response Items University of Wisconsin Milwaukee UWM Digital Commons Theses and Dissertations May 215 Using Differential Item Functioning to Test for Inter-rater Reliability in Constructed Response Items Tamara Beth

More information

Measures. David Black, Ph.D. Pediatric and Developmental. Introduction to the Principles and Practice of Clinical Research

Measures. David Black, Ph.D. Pediatric and Developmental. Introduction to the Principles and Practice of Clinical Research Introduction to the Principles and Practice of Clinical Research Measures David Black, Ph.D. Pediatric and Developmental Neuroscience, NIMH With thanks to Audrey Thurm Daniel Pine With thanks to Audrey

More information

Descriptive Statistics Lecture

Descriptive Statistics Lecture Definitions: Lecture Psychology 280 Orange Coast College 2/1/2006 Statistics have been defined as a collection of methods for planning experiments, obtaining data, and then analyzing, interpreting and

More information

Section 6: Analysing Relationships Between Variables

Section 6: Analysing Relationships Between Variables 6. 1 Analysing Relationships Between Variables Section 6: Analysing Relationships Between Variables Choosing a Technique The Crosstabs Procedure The Chi Square Test The Means Procedure The Correlations

More information

CCM6+7+ Unit 12 Data Collection and Analysis

CCM6+7+ Unit 12 Data Collection and Analysis Page 1 CCM6+7+ Unit 12 Packet: Statistics and Data Analysis CCM6+7+ Unit 12 Data Collection and Analysis Big Ideas Page(s) What is data/statistics? 2-4 Measures of Reliability and Variability: Sampling,

More information

Review Statistics review 2: Samples and populations Elise Whitley* and Jonathan Ball

Review Statistics review 2: Samples and populations Elise Whitley* and Jonathan Ball Available online http://ccforum.com/content/6/2/143 Review Statistics review 2: Samples and populations Elise Whitley* and Jonathan Ball *Lecturer in Medical Statistics, University of Bristol, UK Lecturer

More information

Technical Specifications

Technical Specifications Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically

More information

Measurement and Descriptive Statistics. Katie Rommel-Esham Education 604

Measurement and Descriptive Statistics. Katie Rommel-Esham Education 604 Measurement and Descriptive Statistics Katie Rommel-Esham Education 604 Frequency Distributions Frequency table # grad courses taken f 3 or fewer 5 4-6 3 7-9 2 10 or more 4 Pictorial Representations Frequency

More information

Pooling Subjective Confidence Intervals

Pooling Subjective Confidence Intervals Spring, 1999 1 Administrative Things Pooling Subjective Confidence Intervals Assignment 7 due Friday You should consider only two indices, the S&P and the Nikkei. Sorry for causing the confusion. Reading

More information

1 The conceptual underpinnings of statistical power

1 The conceptual underpinnings of statistical power 1 The conceptual underpinnings of statistical power The importance of statistical power As currently practiced in the social and health sciences, inferential statistics rest solidly upon two pillars: statistical

More information

SUPPLEMENTAL MATERIAL

SUPPLEMENTAL MATERIAL 1 SUPPLEMENTAL MATERIAL Response time and signal detection time distributions SM Fig. 1. Correct response time (thick solid green curve) and error response time densities (dashed red curve), averaged across

More information

Validity and reliability of measurements

Validity and reliability of measurements Validity and reliability of measurements 2 Validity and reliability of measurements 4 5 Components in a dataset Why bother (examples from research) What is reliability? What is validity? How should I treat

More information

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments Greg Pope, Analytics and Psychometrics Manager 2008 Users Conference San Antonio Introduction and purpose of this session

More information

Lesson 1: Distributions and Their Shapes

Lesson 1: Distributions and Their Shapes Lesson 1 Name Date Lesson 1: Distributions and Their Shapes 1. Sam said that a typical flight delay for the sixty BigAir flights was approximately one hour. Do you agree? Why or why not? 2. Sam said that

More information

7/17/2013. Evaluation of Diagnostic Tests July 22, 2013 Introduction to Clinical Research: A Two week Intensive Course

7/17/2013. Evaluation of Diagnostic Tests July 22, 2013 Introduction to Clinical Research: A Two week Intensive Course Evaluation of Diagnostic Tests July 22, 2013 Introduction to Clinical Research: A Two week Intensive Course David W. Dowdy, MD, PhD Department of Epidemiology Johns Hopkins Bloomberg School of Public Health

More information

Reliability, validity, and all that jazz

Reliability, validity, and all that jazz Reliability, validity, and all that jazz Dylan Wiliam King s College London Introduction No measuring instrument is perfect. The most obvious problems relate to reliability. If we use a thermometer to

More information

CHAPTER 3 RESEARCH METHODOLOGY

CHAPTER 3 RESEARCH METHODOLOGY CHAPTER 3 RESEARCH METHODOLOGY 3.1 Introduction 3.1 Methodology 3.1.1 Research Design 3.1. Research Framework Design 3.1.3 Research Instrument 3.1.4 Validity of Questionnaire 3.1.5 Statistical Measurement

More information

How to assess the strength of relationships

How to assess the strength of relationships Publishing Date: April 1994. 1994. All rights reserved. Copyright rests with the author. No part of this article may be reproduced without written permission from the author. Meta Analysis 3 How to assess

More information

Lecture Outline. Biost 517 Applied Biostatistics I. Purpose of Descriptive Statistics. Purpose of Descriptive Statistics

Lecture Outline. Biost 517 Applied Biostatistics I. Purpose of Descriptive Statistics. Purpose of Descriptive Statistics Biost 517 Applied Biostatistics I Scott S. Emerson, M.D., Ph.D. Professor of Biostatistics University of Washington Lecture 3: Overview of Descriptive Statistics October 3, 2005 Lecture Outline Purpose

More information

Lecturer: Rob van der Willigen 11/9/08

Lecturer: Rob van der Willigen 11/9/08 Auditory Perception - Detection versus Discrimination - Localization versus Discrimination - - Electrophysiological Measurements Psychophysical Measurements Three Approaches to Researching Audition physiology

More information