Evaluation of a clinical test. I: Assessment of reliability

Similar documents
COMPUTING READER AGREEMENT FOR THE GRE

Figure 1: Design and outcomes of an independent blind study with gold/reference standard comparison. Adapted from DCEB (1981b)

Unequal Numbers of Judges per Subject

Bladder neck mobility in continent nulliparous women

reproducibility of the interpretation of hysterosalpingography pathology

Comparison of the Null Distributions of

COMMITMENT &SOLUTIONS UNPARALLELED. Assessing Human Visual Inspection for Acceptance Testing: An Attribute Agreement Analysis Case Study

A review of statistical methods in the analysis of data arising from observer reliability studies (Part 11) *

An update on the analysis of agreement for orthodontic indices


Editorial. An audit of the editorial process and peer review in the journal Clinical Rehabilitation. Introduction

02a: Test-Retest and Parallel Forms Reliability

2 Philomeen Weijenborg, Moniek ter Kuile and Frank Willem Jansen.

English 10 Writing Assessment Results and Analysis

Repeatability of a questionnaire to assess respiratory

Teaching A Way of Implementing Statistical Methods for Ordinal Data to Researchers

Agreement Coefficients and Statistical Inference

A novel approach to assess diagnostic test and observer agreement for ordinal data. Zheng Zhang Emory University Atlanta, GA 30322

Relationship Between Intraclass Correlation and Percent Rater Agreement

Lessons in biostatistics

Statistical techniques to evaluate the agreement degree of medicine measurements

(true) Disease Condition Test + Total + a. a + b True Positive False Positive c. c + d False Negative True Negative Total a + c b + d a + b + c + d

A study of adverse reaction algorithms in a drug surveillance program

Introduction On Assessing Agreement With Continuous Measurement

DATA is derived either through. Self-Report Observation Measurement

Maltreatment Reliability Statistics last updated 11/22/05

REPRODUCTIVE ENDOCRINOLOGY

EPIDEMIOLOGY. Training module

Cochrane Pregnancy and Childbirth Group Methodological Guidelines

Statistical Methodology: 11. Reliability and Validity Assessment in Study Design, Part A Daiid J. Karl-as, MD

The recommended method for diagnosing sleep

The development of a questionnaire to measure the severity of symptoms and the quality of life before and after surgery for stress incontinence

7/17/2013. Evaluation of Diagnostic Tests July 22, 2013 Introduction to Clinical Research: A Two week Intensive Course

Week 17 and 21 Comparing two assays and Measurement of Uncertainty Explain tools used to compare the performance of two assays, including

Visual assessment of breast density using Visual Analogue Scales: observer variability, reader attributes and reading time

Assessment of Cardiovascular Autonomic Functions to Predict Development of Pregnancy Induced Hypertension

What is indirect comparison?

Revised Cochrane risk of bias tool for randomized trials (RoB 2.0) Additional considerations for cross-over trials

Statistical probability was first discussed in the

Observed Differences in Diagnostic Test Accuracy between Patient Subgroups: Is It Real or Due to Reference Standard Misclassification?

ADMS Sampling Technique and Survey Studies

Evaluating the Endoscopic Reference Score for eosinophilic esophagitis: moderate to substantial intra- and interobserver reliability

Clinical Chemistry / INTENSIVE INSULIN THERAPY AND GLUCOSE VALUES

Week 2 Video 2. Diagnostic Metrics, Part 1

Scaling the quality of clinical audit projects: a pilot study

Observer variation for radiography, computed tomography, and magnetic resonance imaging of occult hip fractures

Chapter 5: Field experimental designs in agriculture

On the usefulness of the CEFR in the investigation of test versions content equivalence HULEŠOVÁ, MARTINA

NIH Public Access Author Manuscript Tutor Quant Methods Psychol. Author manuscript; available in PMC 2012 July 23.

Scientific Research. The Scientific Method. Scientific Explanation

Global Clinical Trials Innovation Summit Berlin October 2016

Comparing Vertical and Horizontal Scoring of Open-Ended Questionnaires

REVIEW ARTICLE. A Review of Inferential Statistical Methods Commonly Used in Medicine

Validity and reliability of measurements

Further data analysis topics

Main article An introduction to medical statistics for health care professionals: Describing and presenting data

Systematic Reviews. Simon Gates 8 March 2007

How to Conduct a Meta-Analysis

Control of Confounding in the Assessment of Medical Technology

Proper analysis in clinical trials: how to report and adjust for missing outcome data

Assessing Agreement Between Methods Of Clinical Measurement

Observational Studies Week #2. Dr. Michelle Edwards October 24, 2018

Research Article Analysis of Agreement on Traditional Chinese Medical Diagnostics for Many Practitioners

An introduction to power and sample size estimation

Dimensionality, internal consistency and interrater reliability of clinical performance ratings

Establishing Interrater Agreement for Scoring Criterion-based Assessments: Application for the MEES

A profiling system for the assessment of individual needs for rehabilitation with hearing aids

The reliability and validity of the Index of Complexity, Outcome and Need for determining treatment need in Dutch orthodontic practice

Review Statistics review 2: Samples and populations Elise Whitley* and Jonathan Ball

Biases in clinical research. Seungho Ryu, MD, PhD Kanguk Samsung Hospital, Sungkyunkwan University

Prevalence of thyroid disorder in pregnancy and pregnancy outcome

Validity and reliability of measurements

A scoring system for the assessment of bowel and lower urinary tract symptoms in women

4 Diagnostic Tests and Measures of Agreement

W e have previously described the disease impact

how good is the Instrument? Dr Dean McKenzie

A practical tool for locomotion scoring in sheep: Reliability when used by veterinary surgeons and sheep farmers

Running head: ATTRIBUTE CODING FOR RETROFITTING MODELS. Comparison of Attribute Coding Procedures for Retrofitting Cognitive Diagnostic Models

Point-of-service questionnaires can reliably assess patients experiences

SRDC Technical Paper Series How Random Must Random Assignment Be in Random Assignment Experiments?

Evidence-Based Medicine and Publication Bias Desmond Thompson Merck & Co.

Doctoral Dissertation Boot Camp Quantitative Methods Kamiar Kouzekanani, PhD January 27, The Scientific Method of Problem Solving

Reproducibility of childhood respiratory symptom questions

10 Intraclass Correlations under the Mixed Factorial Design

Survey on repeat prescribing for acid suppression drugs in primary care in Cornwall and the Isles of Scilly

Interpreting Kappa in Observational Research: Baserate Matters

University of Bristol - Explore Bristol Research. Publisher's PDF, also known as Version of record

CRITICAL EVALUATION OF BIOMEDICAL LITERATURE

Pain Assessment in Elderly Patients with Severe Dementia

This is a repository copy of Testing for asymptomatic bacteriuria in pregnancy..

Chapter 9. Ellen Hiemstra Navid Hossein pour Khaledian J. Baptist M.Z. Trimbos Frank Willem Jansen. Submitted

Introduction & Basics

Evaluating Quality in Creative Systems. Graeme Ritchie University of Aberdeen

Journal of Biostatistics and Epidemiology

Sanjay P. Zodpey Clinical Epidemiology Unit, Department of Preventive and Social Medicine, Government Medical College, Nagpur, Maharashtra, India.

Types of data and how they can be analysed

AAPOR Exploring the Reliability of Behavior Coding Data

ExperimentalPhysiology

THE USE OF CRONBACH ALPHA RELIABILITY ESTIMATE IN RESEARCH AMONG STUDENTS IN PUBLIC UNIVERSITIES IN GHANA.

Transcription:

British Journal of Obstetrics and Gynaecology June 2001, Vol. 108, pp. 562±567 COMMENTARY Evaluation of a clinical test. I: Assessment of reliability Introduction Testing and screening are critical parts of the clinical process, since inappropriate testing strategies put patients at risk and entail a serious waste of resources 1,2. Based on our recent experiences of evaluating diagnostic literature 3±7, we have come to believe that there is much misunderstanding about the evaluation of clinical tests. Some tests, introduced into practice without proper evaluation, are so inef cient as to be almost useless. In our view, the absence of clear methodological guidelines about the evaluation of clinical tests is a major impediment. Just as robust research methods in assessing the effectiveness of treatments have been actively pursued over the last decade, so attention needs to be focused on how research on diagnostic tests and their impact on clinical practice might be improved. Our commentary is prompted by the concern that there is a huge disparity between the number of clinical tests and the availability of robust research evidence to help make decisions about their most appropriate clinical application. We must rst ask why inef ciency in clinical testing leads to mismanagement of patients. The answer is quite simple. By missing a diagnosis, early therapy cannot be undertaken, thereby prolonging morbidity. On the other hand, by making a diagnosis in the absence of disease unnecessary therapy may be undertaken with the risk of adverse effects. But how does inef ciency in clinical testing arise in the rst place? We need to understand that the results of our tests are the outcomes of the clinical measurements. It is the errors in these clinical measurements that lead to inef ciency in clinical testing. Errors in clinical measurements 8±10 are of two sorts. Firstly, measurement may be inconsistent if the same attribute recorded by another observer (or recorded a second time by the same observer) leads to a different reading. The term reliability refers to this type of measurement error. Secondly, the measurement obtained may not be accurate when compared with the `true' state of the attribute estimated by a suitable reference standard. This type of measurement error is referred to as validity. The goal of research is to determine whether a clinical test measures what is intended (validity), but rst it should be established that it measures something in a consistent fashion (reliability). Based on these two types of errors in clinical measurement, our commentary is divided into two parts. In the rst part, the focus is on appropriate strategies for q RCOG 2001 British Journal of Obstetrics and Gynaecology PII: S0306-5456(00)00150-9 conducting and analysing studies of the reliability of a clinical test. In the second part, strategies for conducting and analysing studies of the validity of a clinical test will be described. Design of a study of reliability Reliability studies are generally reported in the literature as observer variability studies. The study is designed to compare measurements obtained by two or more observers (inter-rater reliability) or by one observer on two or more different occasions (intra-rater reliability). Intra-rater reliability is a prerequisite for inter-rater reliability 8. We will restrict our description to inter-rater reliability. The objective of the study is to measure independently, the same clinical attribute on at least two occasions and then to discern the agreement between these measurements. In order for the reliability of a test to be replicated from a published study, researchers should provide suf cient information about the manner in which the test was conducted 9. The information should cover all important issues with regard to the conduct of the test, such as the preparation of the patients, measurements of biophysical recordings, details of laboratory assays, and computation of results. In studies of reliability, one possible source of bias is the use of measurements from a sample which is not representative of the population being studied 9. Reliability will appear to be more optimistic if the researchers have deliberately discarded dif cult or borderline cases from the study. Such omissions are more likely to occur with convenience or arbitrary methods of sampling. Selection bias is less likely with consecutive or random sampling 6. Sadly, sampling is inadequate in many studies 4,6,7. Studies of reliability also require the observers to be blinded to one another's measurements. Blind recording of measurements avoid bias, since recordings made by one observer are not in uenced by the knowledge of the measurements obtained by other observers; usually blinding is not the case in studies of reliability. In our systematic review of the reliability of bladder volume by ultrasound, we found that blinding of the ultrasonic bladder volume measurements was adequate in only 40% of the studies 5. Unless the sample is representative of the population being studied and unless the recordings of the observers are not made available to one another, we can www.bjog-elsevier.com

COMMENTARIES 563 Table 1. Some types of measurement scales and examples of different reliability studies. Scale Description Example Nominal Non-ranked categories Presence or absence of hypertension (based on particular cut-off level of blood pressure) Ordinal Three or more ranked categories None, mild, moderate, or severe hypertension (based on categories of a range of normal, low high, medium high and very high values of blood pressure) Dimensional Continuous or decimal scale Exact blood pressure values expressed in mm Hg have no con dence in a study of reliability of a clinical test. The precise estimation of the reliability of a test requires an adequate sample size. The calculation of the sample size for studies of reliability can be quite complex. Although methods for the estimation of the appropriate sample size for studies of reliability are available, such calculations are seldom performed. Data analysis of a study of reliability Table 1 shows the different types of measurement encountered in clinical practice, with some examples: nominal (dichotomous); ordinal (ranked); and dimensional (continuous). The important point is that in studies of the reliability of a clinical test, the measurements recorded by the two observers should be expressed on the same type of scale, and with the same number of categories if the data are ordinal. It is important to remember that the purpose of a study of reliability is to determine the agreement (or concordance) of the measurements obtained by two observers, measuring the same clinical attribute independently 8. Nominal scale When dealing with dichotomous data (for example, the presence or absence of hypertension), many researchers will report the percentage agreement as the index of reliability. From the hypothetical example in Table 2, the percentage agreement between the two midwives recording whether pregnant women are hypertensive or normotensive is 91.3%, a statistic that looks impressive because of its closeness to 100% (the value depicting perfect agreement). However, this statistic does not take into account the agreement that was expected to occur by chance alone. In the bottom row of Table 2, we have calculated the chance-expected percentage agreement. It represents a value fairly close to the observed percentage agreement, such that one may conclude that the agreement beyond chance is not very great. The statistic of choice for estimating the agreement between observers using the same nominal or dichotomous scale of measurement is kappa 11, which corrects for the agreement expected by chance. Kappa is the observed agreement minus the agreement expected by chance, divided by perfect agreement minus the agreement Table 2. Agreement (disagreement) between two midwives recording mid trimester diastolic blood pressure in a high-risk antenatal clinic and classifying it as normal or hypertension based on a cut-off level of 90 mmhg. Midwife A Hypertension Normal Total Midwife B Hypertension 10 a 10 b 20 R1 Normal 10 c 200 d 210 R2 Total 20 C1 210 C2 230 N Prevalence of hypertension ˆ (R1 or C1)/N ˆ 20/230 ˆ 8.7% Observed percentage agreement ˆ [(a 1 d)/n] x 100 ˆ [(10 1 200)/230] x 100 ˆ 91.3% Chance-expected percentage agreement Kappa coef cient (k) ˆ observed ± chance-expected percentage agreement perfect±chance-expected percentage agreement ˆ (91.3±84.1)/(100±84.1) ˆ 0.45 (95% CI 0.32±0.58) ˆ {[(R1 x C1)/N] 1 [(R2 x C2)/N]}x100/N ˆ [(R1 x C1) 1 (R2 x C2)] x 100/N 2 ˆ [(20 x 20) 1 (210 x 210)] x 100/230 2 ˆ 84.1%

564 COMMENTARIES Table 3. Guidelines for interpretation of kappa statistic 13. Kappa value Strength of agreement 0 Poor 0±0.20 Slight 0.21±0.40 Fair 0.41±0.60 Moderate 0.61±0.80 Substantial 0.81±1.0 Excellent expected by chance 9 : Po 2 Pe Kappa ˆ 1 2 Pe Where P o is the observed agreement and P e is the agreement expected by chance. Kappa therefore gives more information than simple percentage agreement. Its values range from 0 to 1, with 0 representing no agreement beyond chance and 1 representing perfect agreement. The standard error of kappa allows us to estimate its statistical signi cance and also its 95% con dence interval. These computations can be performed using a computer 12. The magnitude of kappa is a far more important measure of agreement than its statistical signi cance. The guidelines 13 for interpretation of the values of kappa are given in Table 3. Using the example of the agreement between the blood pressure recordings of two midwives in Table 2, the kappa value obtained is 0.45 (95% CI 0.32±0.58), indicating moderate agreement. We should note that the interpretation of kappa is subjective and the values of kappa in Table 3 are considered to be optimistic by some investigators 14,15. The value of kappa depends on the prevalence of the disorder being studied. Suppose the study in Table 2 was repeated in a different population, where the prevalence of hypertension was 40% (Table 4). The observer agreement was found to be lower (kappa ˆ 0.17) (Table 4a). This is because a high prevalence of hypertension results in a high level of chance-expected agreement and hence a lower kappa value; conversely a condition with a low prevalence will tend to give higher values of kappa 16. Therefore, kappa values generated from studies on disparate populations are not easily comparable 14. When there is a systematic difference between the two midwives in recording the presence or absence of hypertension, higher than expected values of the kappa statistic can also be obtained 14. In the above two examples, the prevalences of hypertension diagnosed by both midwives were identical (9% and 40% respectively). Let us assume that the prevalence of hypertension diagnosed by Midwife A was 7% and the prevalence of hypertension diagnosed by Midwife B was 11% (Table 4b). The kappa value is 0.46 which is almost identical to that obtained in the rst study and better than that obtained with the second study, despite the systematic disagreement in the diagnosis of hypertension between the two midwives. The examples in Table 4 illustrate the paradoxes of kappa 17 : in Table 4a, the agreement is moderate, despite the lower value of kappa; and in Table 4b the agreement is poor, despite the higher value of kappa. McNemar's test will estimate the probability that the difference in the number of disagreements between the two observers could have occurred by chance 14. For the information in Table 4b, P ˆ 0.04, Table 4. Effect of different sample recruitment strategies and systematic difference in diagnosis of hypertension on the kappa (k) statistic for antenatal blood pressure measurement by two midwives. A. Study 2 Midwife A Hypertension Normal Total Midwife B Hypertension 10 10 20 Normal 10 20 30 Total 20 30 50 Prevalence of hypertension ˆ 40.0% Kappa coef cient (k) ˆ 0.17 (95% CI 0-0.44) Study 3 Midwife A Hypertension Normal Total B. Midwife B Hypertension 10 15 25 Normal 5 200 205 Total 15 215 230 Prevalence of hypertension (for midwife A) ˆ 6.5% (under diagnosis) Prevalence of hypertension (for midwife B) ˆ 10.9% (over diagnosis) Kappa coef cient (k) ˆ 0.46 (95 %CI 0.33-0.58)

COMMENTARIES 565 suggesting that there is systematic disagreement between the two observers. McNemar's test is available on electronic statistical packages 12. Ordinal scale Here again percentage agreement is commonly reported in the literature, but simple percentage agreement is best avoided since it does not take into account any chance-expected agreement. If the two midwives in Table 1 were asked to classify pregnant women into four ordered categories of blood pressure (i.e. normal blood pressure, mild hypertension, moderate hypertension, severe hypertension), then it is obvious that there are various levels of disagreement. The discrepancy between normotensive and severe hypertensive categories is much worse than that between normotensive and mild hypertensive categories. It is logical to allow some credit for partial agreement and simple percentage agreement fails to do this, the observer agreement appearing less favourable than it actually is. Again the kappa statistic comes to the rescue but here the statistic of choice is the weighted version of kappa 18. Weighted kappa corrects for chance agreement and it also allows credit for partial agreement. Again, electronic statistical packages are available to calculate weighted kappa, its precision (95% con dence intervals) and its statistical signi cance 12. The guidelines in Table 3 can be used to assess the agreement. As before, the quantitative signi cance of weighted kappa is far more important clinically than its statistical signi cance 15. Dimensional scale Pearson's correlation coef cient of the measurements obtained by the two observers has been popular for the assessment of the reliability of clinical tests on a continuous scale 5,14. However, Pearson's correlation coef cient measures the association between two sets of measurements, but not their agreement 8,19. Fig. 1 represents two sets of measurements obtained by two observers, A and B. Line 1 shows perfect association, the correlation coef- cient being 1.0, and also perfect agreement, the measurements obtained by Observer A, being the same as the measurements obtained by Observer B. Line 2 shows perfect association, the correlation coef cient being 1.0, but no agreement, since the measurements obtained by Observer B are always two points greater than the measurements obtained by Observer A. To measure agreement, Bland and Altman recommend the method of limits of agreement 19. This involves a scatter plot of the difference between the measurements obtained by the two observers against the mean of the measurements, for each subject in the study. The 95% limits of agreement is the 95% data interval of the differences between the measurements obtained by the two observers, and is expressed as a range which will encompass 95% of the differences. An example of limits of agreement is the comparison between two-and three-dimensional measurements of the volume of a balloon using ultrasound 20. These measurements were performed, independently, by two observers on 30 balloons of different sizes. Fig. 2 shows the 95% limits of agreement of measurement of the balloon Fig. 1. Plot of measurements obtained by observer A versus that obtained by observer B. Line 1 represents perfect correlation and agreement. Line 2 represents perfect correlation but not agreement. Fig. 2. Limits of agreement with two-dimensional ultrasound.

566 COMMENTARIES volume by two-dimensional ultrasound to be -26.2 ml to 135.8 ml. Fig. 3 shows the 95% limits of agreement of measurement of the balloon volume by three-dimensional ultrasound to be -24.5mL to 118.9mL. The range of the 95% limits of agreement of three-dimensional ultrasound (43.4mL) is less than the range of the 95% limits of agreement of two-dimensional ultrasound (62.0mL), suggesting that three-dimensional ultrasound may be the more reliable test. The interpretation of whether or not there is acceptable agreement between the two observers depends on subjective comparison of the limits of agreement with the range of the measurements normally encountered in clinical practice. As long as the range of the limits of agreement is considered not to be clinically important, then the agreement is acceptable 19. One disadvantage of the method of limits of agreement is that it measures formally the variation between the observers, but takes no formal account of the variation between the subjects in the study. Another method of measuring agreement for continuous data is the intra-class correlation coef cient, which measures formally both the variation between the observers and the variation between the subjects in the study by analysis of variance 8,21. Mathematically, the intraclass correlation coef cient is the proportion of the total variance which is due to the variation between the subjects. An intra-class correlation coef cient of 1 indicates that the total variance is due solely to the variation between the subjects, there being no contribution to the total variance from variation between the observers; while an intra-class correlation coef cient of 0 indicates that none of the total variance is due to variation between Fig. 3. Limits of agreement with three-dimensional ultrasound. subjects and all the total variance being attributed to variation between observers. Therefore, like the kappa statistic, the intra-class correlation coef cient ranges from 0 to 1 where 0 shows no agreement and 1 shows perfect agreement 22. An intra-class correlation coef cient greater than 0.75, is considered to be good agreement 15. An approximate 95% con dence interval for the intraclass correlation coef cient can also be estimated 9. There is considerable debate about the advantages and disadvantages of limits of agreement and the intra-class correlation coef cient 8. Until there is some consensus, we would encourage the use of both the limits of agreement and the intra-class correlation coef cient as measures of reliability of tests using continuous data, and discourage the use of Pearson's correlation coef cient. Khalid S. Khan, Patrick F.W. Chien * Departments of Obstetrics and Gynaecology, Birmingham Women's Hospital, UK and Departments of Obstetrics and Gynaecology, Ninewells Hospital, Dundee, UK References 1. Koran LM. The reliability of clinical methods, data and judgements [two parts]. N Engl J Med 1975;293:642±646 695-701. 2. Department of Clinical Epidemiology and Biostatistics. Clinical disagreement: I. How often it occurs and why. Can Med Assoc J 1980;123:499±504. 3. Khan KS, Khan SF, Nwosu CR, Chien PFW. Misleading authors' inferences in obstetric diagnostic test literature. Am J Obstet Gynecol 1999;181:112±115. 4. Nwosu CR, Khan KS, Chien PFW, Honest MR. Is real-time ultrasonic bladder volume estimation reliable and valid? A systematic overview. Scand J Urol Nephrol 1998;32:325±330. 5. Khan KS, Chien PFW, Honest MR. Evaluating the measurement variability of clinical investigations: The case of ultrasonic estimation of urinary bladder volume. Br J Obstet Gynaecol 1997;104:1036±1042. 6. Chien PFW, Khan KS, Ogston S, Owen P. The diagnostic accuracy of cervico-vaginal fetal bronectin in predicting preterm delivery: an overview. Br J Obstet Gynaecol 1997;104:436±444. 7. Chien PFW, Arnott N, Gordon A, Owen P, Khan KS. How useful is uterine artery Doppler ow velocimetry in the prediction of preeclampsia, intrauterine growth retardation and perinatal death? An overview. Br J Obstet Gynaecol 2000;107:196±208. 8. Streiner DL, Norman GR. Health Measurement Scales: A Practical Guide to Their Development and Use. New York: Oxford University Press, 1995. 9. Dunn G, Everitt B. Clinical Biostatistics. An Introduction to Evidence- Based Medicine. London: Edward Arnold, 1995. 10. Healy MJ. Measuring measuring errors. Stat Med 1989;8:893±906. 11. Fleiss JL. Measuring agreement between two judges on the presence or absence of a trait. Biometrics 1975;31:651±659. 12. Buchan I. Arcus QuickStat (Biomedical) Version 1.2. Cambridge: Addision Wesley Longman, 1998. 13. Landis RJ, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977;33:159±174. 14. Brenan P, Silman A. Statistical methods for assessing observer variability in clinical measures. BMJ 1992;304:1491±1494. 15. Kramer MS, Feinstein AR. The biostatistics of concordance. Clin Pharmacol Ther 1981;29:111±123.

COMMENTARIES 567 16. Thompson WG, Walter DW. A reappraisal of the kappa coef cient. J Clin Epidemiol 1988;41:949±958. 17. Feinstein AR, Cicchetti DV. High agreement but low kappa: I. The problem of two paradoxes. J Clin Epidemiol 1990;43:543±549. 18. Cohen J. Weighted Kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psycol Bull 1968;70:213±220. 19. Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986;i: 307±310. 20. Farrell T, Leslie JR, Chien PFW, Agustsson P. The reliability and validity of three dimensional ultrasound volumetric measurements using an in vitro balloon and in vivo uterine model. Br J Obstet Gynaecol 2001;108:??. 21. Bartko JJ. The intraclass correlation coef cient as a measure of reliability. Psychol Rep 1966;19:3±11. 22. Fleiss JL, Cohen J. The equivalence of weighted kappa and intraclass correlation coef cient as measures of reliability. Educ Psychol Meas 1973;2:113±117.