Evaluation of a clinical test. I: Assessment of reliability

Size: px

Start display at page:

Download "Evaluation of a clinical test. I: Assessment of reliability"

Melvyn Bond
5 years ago
Views:

1 British Journal of Obstetrics and Gynaecology June 2001, Vol. 108, pp. 562±567 COMMENTARY Evaluation of a clinical test. I: Assessment of reliability Introduction Testing and screening are critical parts of the clinical process, since inappropriate testing strategies put patients at risk and entail a serious waste of resources 1,2. Based on our recent experiences of evaluating diagnostic literature 3±7, we have come to believe that there is much misunderstanding about the evaluation of clinical tests. Some tests, introduced into practice without proper evaluation, are so inef cient as to be almost useless. In our view, the absence of clear methodological guidelines about the evaluation of clinical tests is a major impediment. Just as robust research methods in assessing the effectiveness of treatments have been actively pursued over the last decade, so attention needs to be focused on how research on diagnostic tests and their impact on clinical practice might be improved. Our commentary is prompted by the concern that there is a huge disparity between the number of clinical tests and the availability of robust research evidence to help make decisions about their most appropriate clinical application. We must rst ask why inef ciency in clinical testing leads to mismanagement of patients. The answer is quite simple. By missing a diagnosis, early therapy cannot be undertaken, thereby prolonging morbidity. On the other hand, by making a diagnosis in the absence of disease unnecessary therapy may be undertaken with the risk of adverse effects. But how does inef ciency in clinical testing arise in the rst place? We need to understand that the results of our tests are the outcomes of the clinical measurements. It is the errors in these clinical measurements that lead to inef ciency in clinical testing. Errors in clinical measurements 8±10 are of two sorts. Firstly, measurement may be inconsistent if the same attribute recorded by another observer (or recorded a second time by the same observer) leads to a different reading. The term reliability refers to this type of measurement error. Secondly, the measurement obtained may not be accurate when compared with the `true' state of the attribute estimated by a suitable reference standard. This type of measurement error is referred to as validity. The goal of research is to determine whether a clinical test measures what is intended (validity), but rst it should be established that it measures something in a consistent fashion (reliability). Based on these two types of errors in clinical measurement, our commentary is divided into two parts. In the rst part, the focus is on appropriate strategies for q RCOG 2001 British Journal of Obstetrics and Gynaecology PII: S (00) conducting and analysing studies of the reliability of a clinical test. In the second part, strategies for conducting and analysing studies of the validity of a clinical test will be described. Design of a study of reliability Reliability studies are generally reported in the literature as observer variability studies. The study is designed to compare measurements obtained by two or more observers (inter-rater reliability) or by one observer on two or more different occasions (intra-rater reliability). Intra-rater reliability is a prerequisite for inter-rater reliability 8. We will restrict our description to inter-rater reliability. The objective of the study is to measure independently, the same clinical attribute on at least two occasions and then to discern the agreement between these measurements. In order for the reliability of a test to be replicated from a published study, researchers should provide suf cient information about the manner in which the test was conducted 9. The information should cover all important issues with regard to the conduct of the test, such as the preparation of the patients, measurements of biophysical recordings, details of laboratory assays, and computation of results. In studies of reliability, one possible source of bias is the use of measurements from a sample which is not representative of the population being studied 9. Reliability will appear to be more optimistic if the researchers have deliberately discarded dif cult or borderline cases from the study. Such omissions are more likely to occur with convenience or arbitrary methods of sampling. Selection bias is less likely with consecutive or random sampling 6. Sadly, sampling is inadequate in many studies 4,6,7. Studies of reliability also require the observers to be blinded to one another's measurements. Blind recording of measurements avoid bias, since recordings made by one observer are not in uenced by the knowledge of the measurements obtained by other observers; usually blinding is not the case in studies of reliability. In our systematic review of the reliability of bladder volume by ultrasound, we found that blinding of the ultrasonic bladder volume measurements was adequate in only 40% of the studies 5. Unless the sample is representative of the population being studied and unless the recordings of the observers are not made available to one another, we can

2 COMMENTARIES 563 Table 1. Some types of measurement scales and examples of different reliability studies. Scale Description Example Nominal Non-ranked categories Presence or absence of hypertension (based on particular cut-off level of blood pressure) Ordinal Three or more ranked categories None, mild, moderate, or severe hypertension (based on categories of a range of normal, low high, medium high and very high values of blood pressure) Dimensional Continuous or decimal scale Exact blood pressure values expressed in mm Hg have no con dence in a study of reliability of a clinical test. The precise estimation of the reliability of a test requires an adequate sample size. The calculation of the sample size for studies of reliability can be quite complex. Although methods for the estimation of the appropriate sample size for studies of reliability are available, such calculations are seldom performed. Data analysis of a study of reliability Table 1 shows the different types of measurement encountered in clinical practice, with some examples: nominal (dichotomous); ordinal (ranked); and dimensional (continuous). The important point is that in studies of the reliability of a clinical test, the measurements recorded by the two observers should be expressed on the same type of scale, and with the same number of categories if the data are ordinal. It is important to remember that the purpose of a study of reliability is to determine the agreement (or concordance) of the measurements obtained by two observers, measuring the same clinical attribute independently 8. Nominal scale When dealing with dichotomous data (for example, the presence or absence of hypertension), many researchers will report the percentage agreement as the index of reliability. From the hypothetical example in Table 2, the percentage agreement between the two midwives recording whether pregnant women are hypertensive or normotensive is 91.3%, a statistic that looks impressive because of its closeness to 100% (the value depicting perfect agreement). However, this statistic does not take into account the agreement that was expected to occur by chance alone. In the bottom row of Table 2, we have calculated the chance-expected percentage agreement. It represents a value fairly close to the observed percentage agreement, such that one may conclude that the agreement beyond chance is not very great. The statistic of choice for estimating the agreement between observers using the same nominal or dichotomous scale of measurement is kappa 11, which corrects for the agreement expected by chance. Kappa is the observed agreement minus the agreement expected by chance, divided by perfect agreement minus the agreement Table 2. Agreement (disagreement) between two midwives recording mid trimester diastolic blood pressure in a high-risk antenatal clinic and classifying it as normal or hypertension based on a cut-off level of 90 mmhg. Midwife A Hypertension Normal Total Midwife B Hypertension 10 a 10 b 20 R1 Normal 10 c 200 d 210 R2 Total 20 C1 210 C2 230 N Prevalence of hypertension ˆ (R1 or C1)/N ˆ 20/230 ˆ 8.7% Observed percentage agreement ˆ [(a 1 d)/n] x 100 ˆ [( )/230] x 100 ˆ 91.3% Chance-expected percentage agreement Kappa coef cient (k) ˆ observed ± chance-expected percentage agreement perfect±chance-expected percentage agreement ˆ (91.3±84.1)/(100±84.1) ˆ 0.45 (95% CI 0.32±0.58) ˆ {[(R1 x C1)/N] 1 [(R2 x C2)/N]}x100/N ˆ [(R1 x C1) 1 (R2 x C2)] x 100/N 2 ˆ [(20 x 20) 1 (210 x 210)] x 100/230 2 ˆ 84.1%

3 564 COMMENTARIES Table 3. Guidelines for interpretation of kappa statistic 13. Kappa value Strength of agreement 0 Poor 0±0.20 Slight 0.21±0.40 Fair 0.41±0.60 Moderate 0.61±0.80 Substantial 0.81±1.0 Excellent expected by chance 9 : Po 2 Pe Kappa ˆ 1 2 Pe Where P o is the observed agreement and P e is the agreement expected by chance. Kappa therefore gives more information than simple percentage agreement. Its values range from 0 to 1, with 0 representing no agreement beyond chance and 1 representing perfect agreement. The standard error of kappa allows us to estimate its statistical signi cance and also its 95% con dence interval. These computations can be performed using a computer 12. The magnitude of kappa is a far more important measure of agreement than its statistical signi cance. The guidelines 13 for interpretation of the values of kappa are given in Table 3. Using the example of the agreement between the blood pressure recordings of two midwives in Table 2, the kappa value obtained is 0.45 (95% CI 0.32±0.58), indicating moderate agreement. We should note that the interpretation of kappa is subjective and the values of kappa in Table 3 are considered to be optimistic by some investigators 14,15. The value of kappa depends on the prevalence of the disorder being studied. Suppose the study in Table 2 was repeated in a different population, where the prevalence of hypertension was 40% (Table 4). The observer agreement was found to be lower (kappa ˆ 0.17) (Table 4a). This is because a high prevalence of hypertension results in a high level of chance-expected agreement and hence a lower kappa value; conversely a condition with a low prevalence will tend to give higher values of kappa 16. Therefore, kappa values generated from studies on disparate populations are not easily comparable 14. When there is a systematic difference between the two midwives in recording the presence or absence of hypertension, higher than expected values of the kappa statistic can also be obtained 14. In the above two examples, the prevalences of hypertension diagnosed by both midwives were identical (9% and 40% respectively). Let us assume that the prevalence of hypertension diagnosed by Midwife A was 7% and the prevalence of hypertension diagnosed by Midwife B was 11% (Table 4b). The kappa value is 0.46 which is almost identical to that obtained in the rst study and better than that obtained with the second study, despite the systematic disagreement in the diagnosis of hypertension between the two midwives. The examples in Table 4 illustrate the paradoxes of kappa 17 : in Table 4a, the agreement is moderate, despite the lower value of kappa; and in Table 4b the agreement is poor, despite the higher value of kappa. McNemar's test will estimate the probability that the difference in the number of disagreements between the two observers could have occurred by chance 14. For the information in Table 4b, P ˆ 0.04, Table 4. Effect of different sample recruitment strategies and systematic difference in diagnosis of hypertension on the kappa (k) statistic for antenatal blood pressure measurement by two midwives. A. Study 2 Midwife A Hypertension Normal Total Midwife B Hypertension Normal Total Prevalence of hypertension ˆ 40.0% Kappa coef cient (k) ˆ 0.17 (95% CI ) Study 3 Midwife A Hypertension Normal Total B. Midwife B Hypertension Normal Total Prevalence of hypertension (for midwife A) ˆ 6.5% (under diagnosis) Prevalence of hypertension (for midwife B) ˆ 10.9% (over diagnosis) Kappa coef cient (k) ˆ 0.46 (95 %CI )

4 COMMENTARIES 565 suggesting that there is systematic disagreement between the two observers. McNemar's test is available on electronic statistical packages 12. Ordinal scale Here again percentage agreement is commonly reported in the literature, but simple percentage agreement is best avoided since it does not take into account any chance-expected agreement. If the two midwives in Table 1 were asked to classify pregnant women into four ordered categories of blood pressure (i.e. normal blood pressure, mild hypertension, moderate hypertension, severe hypertension), then it is obvious that there are various levels of disagreement. The discrepancy between normotensive and severe hypertensive categories is much worse than that between normotensive and mild hypertensive categories. It is logical to allow some credit for partial agreement and simple percentage agreement fails to do this, the observer agreement appearing less favourable than it actually is. Again the kappa statistic comes to the rescue but here the statistic of choice is the weighted version of kappa 18. Weighted kappa corrects for chance agreement and it also allows credit for partial agreement. Again, electronic statistical packages are available to calculate weighted kappa, its precision (95% con dence intervals) and its statistical signi cance 12. The guidelines in Table 3 can be used to assess the agreement. As before, the quantitative signi cance of weighted kappa is far more important clinically than its statistical signi cance 15. Dimensional scale Pearson's correlation coef cient of the measurements obtained by the two observers has been popular for the assessment of the reliability of clinical tests on a continuous scale 5,14. However, Pearson's correlation coef cient measures the association between two sets of measurements, but not their agreement 8,19. Fig. 1 represents two sets of measurements obtained by two observers, A and B. Line 1 shows perfect association, the correlation coef- cient being 1.0, and also perfect agreement, the measurements obtained by Observer A, being the same as the measurements obtained by Observer B. Line 2 shows perfect association, the correlation coef cient being 1.0, but no agreement, since the measurements obtained by Observer B are always two points greater than the measurements obtained by Observer A. To measure agreement, Bland and Altman recommend the method of limits of agreement 19. This involves a scatter plot of the difference between the measurements obtained by the two observers against the mean of the measurements, for each subject in the study. The 95% limits of agreement is the 95% data interval of the differences between the measurements obtained by the two observers, and is expressed as a range which will encompass 95% of the differences. An example of limits of agreement is the comparison between two-and three-dimensional measurements of the volume of a balloon using ultrasound 20. These measurements were performed, independently, by two observers on 30 balloons of different sizes. Fig. 2 shows the 95% limits of agreement of measurement of the balloon Fig. 1. Plot of measurements obtained by observer A versus that obtained by observer B. Line 1 represents perfect correlation and agreement. Line 2 represents perfect correlation but not agreement. Fig. 2. Limits of agreement with two-dimensional ultrasound.

5 566 COMMENTARIES volume by two-dimensional ultrasound to be ml to ml. Fig. 3 shows the 95% limits of agreement of measurement of the balloon volume by three-dimensional ultrasound to be -24.5mL to 118.9mL. The range of the 95% limits of agreement of three-dimensional ultrasound (43.4mL) is less than the range of the 95% limits of agreement of two-dimensional ultrasound (62.0mL), suggesting that three-dimensional ultrasound may be the more reliable test. The interpretation of whether or not there is acceptable agreement between the two observers depends on subjective comparison of the limits of agreement with the range of the measurements normally encountered in clinical practice. As long as the range of the limits of agreement is considered not to be clinically important, then the agreement is acceptable 19. One disadvantage of the method of limits of agreement is that it measures formally the variation between the observers, but takes no formal account of the variation between the subjects in the study. Another method of measuring agreement for continuous data is the intra-class correlation coef cient, which measures formally both the variation between the observers and the variation between the subjects in the study by analysis of variance 8,21. Mathematically, the intraclass correlation coef cient is the proportion of the total variance which is due to the variation between the subjects. An intra-class correlation coef cient of 1 indicates that the total variance is due solely to the variation between the subjects, there being no contribution to the total variance from variation between the observers; while an intra-class correlation coef cient of 0 indicates that none of the total variance is due to variation between Fig. 3. Limits of agreement with three-dimensional ultrasound. subjects and all the total variance being attributed to variation between observers. Therefore, like the kappa statistic, the intra-class correlation coef cient ranges from 0 to 1 where 0 shows no agreement and 1 shows perfect agreement 22. An intra-class correlation coef cient greater than 0.75, is considered to be good agreement 15. An approximate 95% con dence interval for the intraclass correlation coef cient can also be estimated 9. There is considerable debate about the advantages and disadvantages of limits of agreement and the intra-class correlation coef cient 8. Until there is some consensus, we would encourage the use of both the limits of agreement and the intra-class correlation coef cient as measures of reliability of tests using continuous data, and discourage the use of Pearson's correlation coef cient. Khalid S. Khan, Patrick F.W. Chien * Departments of Obstetrics and Gynaecology, Birmingham Women's Hospital, UK and Departments of Obstetrics and Gynaecology, Ninewells Hospital, Dundee, UK References 1. Koran LM. The reliability of clinical methods, data and judgements [two parts]. N Engl J Med 1975;293:642± Department of Clinical Epidemiology and Biostatistics. Clinical disagreement: I. How often it occurs and why. Can Med Assoc J 1980;123:499± Khan KS, Khan SF, Nwosu CR, Chien PFW. Misleading authors' inferences in obstetric diagnostic test literature. Am J Obstet Gynecol 1999;181:112± Nwosu CR, Khan KS, Chien PFW, Honest MR. Is real-time ultrasonic bladder volume estimation reliable and valid? A systematic overview. Scand J Urol Nephrol 1998;32:325± Khan KS, Chien PFW, Honest MR. Evaluating the measurement variability of clinical investigations: The case of ultrasonic estimation of urinary bladder volume. Br J Obstet Gynaecol 1997;104:1036± Chien PFW, Khan KS, Ogston S, Owen P. The diagnostic accuracy of cervico-vaginal fetal bronectin in predicting preterm delivery: an overview. Br J Obstet Gynaecol 1997;104:436± Chien PFW, Arnott N, Gordon A, Owen P, Khan KS. How useful is uterine artery Doppler ow velocimetry in the prediction of preeclampsia, intrauterine growth retardation and perinatal death? An overview. Br J Obstet Gynaecol 2000;107:196± Streiner DL, Norman GR. Health Measurement Scales: A Practical Guide to Their Development and Use. New York: Oxford University Press, Dunn G, Everitt B. Clinical Biostatistics. An Introduction to Evidence- Based Medicine. London: Edward Arnold, Healy MJ. Measuring measuring errors. Stat Med 1989;8:893± Fleiss JL. Measuring agreement between two judges on the presence or absence of a trait. Biometrics 1975;31:651± Buchan I. Arcus QuickStat (Biomedical) Version 1.2. Cambridge: Addision Wesley Longman, Landis RJ, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977;33:159± Brenan P, Silman A. Statistical methods for assessing observer variability in clinical measures. BMJ 1992;304:1491± Kramer MS, Feinstein AR. The biostatistics of concordance. Clin Pharmacol Ther 1981;29:111±123.

6 COMMENTARIES Thompson WG, Walter DW. A reappraisal of the kappa coef cient. J Clin Epidemiol 1988;41:949± Feinstein AR, Cicchetti DV. High agreement but low kappa: I. The problem of two paradoxes. J Clin Epidemiol 1990;43:543± Cohen J. Weighted Kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psycol Bull 1968;70:213± Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986;i: 307± Farrell T, Leslie JR, Chien PFW, Agustsson P. The reliability and validity of three dimensional ultrasound volumetric measurements using an in vitro balloon and in vivo uterine model. Br J Obstet Gynaecol 2001;108:??. 21. Bartko JJ. The intraclass correlation coef cient as a measure of reliability. Psychol Rep 1966;19:3± Fleiss JL, Cohen J. The equivalence of weighted kappa and intraclass correlation coef cient as measures of reliability. Educ Psychol Meas 1973;2:113±117.

COMPUTING READER AGREEMENT FOR THE GRE

RM-00-8 R E S E A R C H M E M O R A N D U M COMPUTING READER AGREEMENT FOR THE GRE WRITING ASSESSMENT Donald E. Powers Princeton, New Jersey 08541 October 2000 Computing Reader Agreement for the GRE Writing