Studies reporting ROC curves of diagnostic and prediction data can be incorporated into meta-analyses using corresponding odds ratios

Journal of Clinical Epidemiology 60 (2007) 530e534 BRIEF REPORT Studies reporting ROC curves of diagnostic and prediction data can be incorporated into meta-analyses using corresponding odds ratios S.D. Walter a, *, T. Sinuff b a Department of Clinical Epidemiology and Biostatistics, McMaster University, 1200 Main Street West, HSC-2C16, Hamilton, Ontario, Canada L8N 3Z5 b Department of Critical Care Medicine, Sunnybrook Health Sciences Centre and Interdepartmental Division of Critical Care, University of Toronto, Toronto, Ontario, Canada Accepted 16 September 2006 Abstract Objective: To develop an approach by which studies describing the accuracy of diagnostic tests or clinical predictions can be combined in a meta-analysis, even though studies may report their results using different summary measures. Study design: A method is proposed to allow algebraic and numerical conversion of values of the Receiver Operating Characteristic Area Under the Curve (AUC) summary statistic into corresponding odds ratios (OR). A similar conversion is demonstrated for the standard errors (SEs) of these summary statistics. Results: The conversion of the AUC values into OR values was achieved using a logit-threshold model. The delta method was used to convert the associated SEs. An example concerning predictions of mortality in the intensive care unit illustrates the calculations. Conclusion: This paper provides an accessible method that permits the meta-analyst to overcome some of the difficulties implied by incomplete and inconsistent reporting of research studies in this area. It allows all studies to be included on the same metric, which in turn more easily permits exploration of issues such as heterogeneity. The method can readily be used for meta-analyses of diagnostic or screening tests, or for prediction data. Ó 2007 Elsevier Inc. All rights reserved. Keywords: Meta analysis; Diagnostic test; Prediction; Area Under the Curve; Odds ratio; Biostatistics 1. Introduction A difficulty encountered in carrying out a systematic review of a diagnostic test (or a prediction method) is that the component studies may report their results in different ways. Specifically, while some report diagnostic odds ratios (OR), others provide the Area Under the Curve (AUC) from a Receiver Operating Characteristic (ROC) curve. While both the OR and AUC are valid summary measures of diagnostic accuracy, they are on different metrics, which makes the combination of studies into the meta-analysis problematic. This difficulty is analogous to attempting to combine studies of therapeutic effectiveness that report different types of effect measure, for example, a mean reduction in blood pressure vs. the percentage of patients whose blood pressure has been reduced below a target threshold value [1]. In this paper we will illustrate a method to convert the studies reporting AUC values into corresponding ORs, which then facilitates their inclusion in a meta-analysis using standard methods for combining ORs [2]. The conversion of the AUC values into OR point estimates will be achieved using a logit-threshold model. We also use the delta method to obtain a conversion of their associated standard errors (SEs) and using a result proposed by Hanley and McNeil [3] for the SE of AUC. Our conversion method can be used equally well for data originating from a predictive rule (as in our illustrative example) or from a diagnostic test. * Corresponding author. Tel.: 905-525-9140 ext. 23387; fax: 905-577- 0017. E-mail address: walter@mcmaster.ca (S.D. Walter). 2. Example We were motivated to investigate this topic during a systematic overview of the accuracy of physicians predicting 0895-4356/07/$ e see front matter Ó 2007 Elsevier Inc. All rights reserved. doi: 10.1016/j.jclinepi.2006.09.002

S.D. Walter, T. Sinuff / Journal of Clinical Epidemiology 60 (2007) 530e534 531 future mortality of critically ill patients in intensive care units. Because of the subjective judgment required on the part of the physicians, we wished to compare their predictive accuracy with that of certain objective scoring systems. Full details of our systematic overview are given elsewhere [4], so we provide here only those details required for illustration of our method. From a total of 1,626 citations, 12 studies were identified that met our inclusion criteria. A funnel plot revealed no evidence of publication bias. There were seven studies where we could construct separate 2 2 tables of the mortality predicted by physicians and scoring systems vs. actual mortality. A prediction threshold of 50% was used to dichotomize data from studies where predictions were made using interval or continuous data. From each 2 2 table, a diagnostic OR was computed as the summary measure of predictive accuracy. The OR here is the ratio of the odds of prediction of death in a patient who actually dies, compared to the odds for a patient who actually survives. Thus, OR is obtained from (TP TN)/(FP FN), where TP, TN, FP, and FN are the true positive, true negative, false positive, and false negative predictions, respectively. Five studies did not give ORs, but instead used ROC methodology to report their findings. An ROC curve plots the TP prediction rate vs. the FP prediction rate as the prediction threshold varies between 0 and 1. These studies summarized their ROC curves using the AUC measure. A perfect prediction would yield AUC 5 1, whereas AUC 5 ½ would suggest predictive accuracy equal to that of chance alone. Two of the five studies reported a SE for the AUC, one gave confidence intervals (CI) for AUC, and the remaining two studies gave no indication of the precision of their AUC estimates. All five studies reported sample sizes, including the total number of patients and the number of deaths occurring. The five ROC studies had a total sample size of 3,215, while the seven studies reporting ORs had a total sample size of 2,706. Therefore, the ROC studies contained over half (54%) of the available data, and it seemed essential to be able to include them in the meta-analysis. In order to incorporate them in a standard analysis to estimate the overall OR, we needed to convert the AUC results into the OR metric. 3. Method We adopted the logit-threshold meta-analysis model proposed by Moses et al. [5,6] which relates the diagnostic (or predictive) OR to the test threshold. The model is defined as: D 5 a þ bs where D 5 ln(tpr/[1 TPR]) ln(fpr/[1 FPR]) is the diagnostic log odds ratio, ln(or), and S 5 ln(tpr/ [1 TPR]) þ ln(fpr/[1 FPR]) is a function of the test threshold. TPR and FPR are the true positive and false positive rates of the test, respectively. The regression slope b indicates the dependence of the test accuracy on the threshold. The intercept a can be taken as an overall estimate of ln OR. If the component studies in a meta-analysis are assumed to be approximately homogeneous with respect to their ORs, then we have shown elsewhere [7] that the AUC under the summary ROC curve can be expressed in terms of the corresponding OR through the following equation: AUC 5 OR ð1þ 2½ðOR 1Þ lnðorþš: ðor 1Þ This expression also gives a good approximation to the AUC even if the component studies are heterogeneous [7]. In the context of ROC curves for single studies, the homogeneous logit-threshold model is equivalent to assuming two logistic distributions for an underlying continuum of the test results, and where the distributions for true cases and noncases of disease differ only by a location shift. In the heterogeneous situation, the two distributions also differ by a scale factor (implying different variances). The logistic distribution also gives a reasonably close approximation to normally distributed test results. When one of our studies reported an AUC summary measure, we used equation (1) iteratively to solve for the corresponding value of OR, thus achieving the required conversion. 3.1. Standard errors We also converted SEs for the AUC estimates into the OR metric. First, we applied the delta method to equation (1) to obtain an approximate SE for AUC in terms of the SE of OR, as follows: 1 SEðAUCÞ 5 3½ðOR þ 1ÞlnðORÞ ðor 1Þ 2ðOR 1ÞŠSEðORÞ [cf. Ref. [7], equation 13]. This expression may be rearranged as SEðAUCÞðOR 1Þ 3 SEðORÞ 5 ðor þ 1ÞlnðORÞ 2ðOR 1Þ in order to give SE(OR) in terms of SE(AUC), the direction of conversion that we require here. We then used a method by Hanley and McNeil [3] through an approximation described by Zhou et al. [8] to obtain a SE for AUC from VarðAUCÞ5Q 1 =nþq 2 =m AUC 2 ðmþnþ=mn ð2þ where Q 1 5 AUC/(2 AUC), Q 2 5 2AUC/(1 þ AUC), m 5 number of deaths, n 5 number of survivors, and SE(AUC) 5 [Var(AUC)] 1/2. Finally, having obtained estimates of OR and their SEs, the five studies using ROC

532 S.D. Walter, T. Sinuff / Journal of Clinical Epidemiology 60 (2007) 530e534 methods could then be combined with the other seven studies that had reported in terms of OR originally. 4. Results Table 1 shows the AUC results for the five studies [9e13] that reported findings using ROC methods. Also shown are the corresponding ORs and their SEs, obtained using the approach outlined above. For the meta-analysis itself, all studies were expressed in terms of ln(or) and its corresponding SE, because of the greater stability and symmetry of the estimates after the logarithmic transformation [14]. Each study shows results for mortality predictions by physicians and a scoring algorithm, or in the case of one study [13], two alternative scoring methods. Three studies reported the AUC values to two decimal places (d.p.), while two studies gave them to three d.p. The limited accuracy of the former group may create numerical problems in the meta-analysis, because one then has only limited implied precision on the corresponding estimates of OR. For example, the study by Poses et al. [10] gives the AUC for physicians predictions as 0.82 (i.e., to two d.p.). From equation (1) we find that this value is compatible with (to an accuracy of one d.p.) OR values from 9.0 to 9.8, which is a relatively wide range. The associated values of SE(OR) ranged from 3.77 to 4.18. In contrast, the study by Christensen et al. [9] gives the physicians AUC 5 0.803, that is, the summary measure is reported to three d.p. The value of 0.803 is compatible with OR 5 7.94 to 8.01 and the associated values of SE(OR) range from 2.36 to 2.39. Thus, the narrower ranges for OR and its SE from studies that give their AUC value to more decimals is clearly evident. In fact, in meta-analyses that involve only a small number of studies, limited accuracy in reported AUC values might necessitate a sensitivity analysis to evaluate the impact of assuming various OR values within the range that is compatible with a study s AUC. Two of the ROC studies [10,12] in Table 1 reported empirical SEs for their AUC values, for each of the clinician and scoring system predictions. Numerical comparisons revealed that these were close to the SEs obtained from our delta method calculation. For the study that reported CI for AUC [12], we inferred an approximate empirical value of SE(AUC) from the width of the interval. For example, the width of the 95% CI for the AUC from physicians predictions was 0.10, implying an approximate SE(AUC) of 0.10/(2 1.96) 5 0.0255; the value calculated from the delta method was slightly higher, at 0.0299. The corresponding empirical and delta method SEs from the two scoring algorithms in this study were 0.0281 vs. 0.0348 (scoring method 1) and 0.0306 vs. 0. 0359 (scoring method 2). The formulation by Zhou et al. [8] of SE(AUC) approximates the Hanley and McNeil method by ignoring second order and higher terms. We checked the accuracy of the Zhou et al. [8] approximation in two of the example papers. For the physicians prediction in the Christensen et al. paper [9], the discrepancy in SE(AUC) was less than 0.2%. As seen in Table 1, the Christensen et al. paper [9] has a fairly typical sample size. We also checked the accuracy of the Zhou et al. approximation using the physicians predictions from the Poses et al. study [10] that has the smallest sample size. Here the discrepancy of the approximation was approximately 4.5%. We regard this accuracy as quite satisfactory in practice. We constructed separate 2 2 tables of predicted mortality by physicians and scoring systems vs. actual mortality for the seven other studies that reported the original data [15e21]. The summary AUC for these studies was 0.85 (SE 5 0.03) for physician predictions compared to 0.63 (SE 5 0.06) for scoring system predictions, P 5 0.002. The physicians summary OR derived from the AUC was significantly higher (12.43; 95% CI 5.47, 27.11) than the scoring systems summary OR (2.25; 95% CI 0.78, 6.52), P 5 0.001. The combined results of all 12 studies indicated that physicians predict mortality more accurately than Table 1 Summary data from five studies using ROC methods to report accuracy of mortality prediction Study Sample size Number of deaths Prediction AUC OR SE (OR) Christensen et al. 229 78 MD 0.803 8.0 2.37 Scoring 0.706 3.7 0.99 Garrouste-Orgeas et al. 334 86 MD 0.861 14.5 4.49 Scoring 0.603 1.9 0.42 Knaus et al. 2057 993 MD 0.78 6.5 0.56 Scoring 0.77 6.0 0.51 Poses et al. 183 38 MD 0.82 9.4 3.95 Scoring 0.87 16.1 7.63 Scholz et al. 412 73 MD 0.84 11.4 3.59 Scoring-1 0.75 5.1 1.39 Scoring-2 0.72 4.1 1.07 AUC, Area Under the Curve (number of decimal places as reported in original papers); OR, Odds ratio; SE, Standard error.

S.D. Walter, T. Sinuff / Journal of Clinical Epidemiology 60 (2007) 530e534 533 scoring systems: the ratio of ORs (with 95% CI) was 1.92 (1.19, 3.08), P 5 0.007. 5. Discussion This paper has provided an accessible method to convert diagnostic or predictive studies that report their results in terms of AUC under the ROC curves into corresponding values of the OR. This approach allows all studies to be incorporated in a meta-analysis using the same metric, which in turn allows investigators to more easily explore issues such as heterogeneity. Approaches to analysis using the OR will be familiar to researchers who have conducted meta-analyses of therapeutic interventions. Of note, the Hanley and McNeil [3] formula is exact for nonparametric data, and it is very robust for various parametric distributions of data, including the normal, the gamma, and the negative exponential distributions. Hanley and McNeil [3] also indicate that the distribution of the AUC estimate will not display significant skewness as long as the number of misclassified pairs (conceptually, the pairs are made up of individuals with and without events) is less than five. The worst-case scenario for our example is exemplified by the study of Poses et al. [10], which has the smallest total sample size and the smallest number of deaths. The expected number of misclassified pairs for the Poses et al. study is approximately 12, well above the Hanley and McNeil threshold. We therefore conclude that it is reasonable to suppose the sampling distributions of the AUC estimates in our component studies will be reasonably symmetric. The demonstrated first order accuracy of the Hanley and McNeil [3] formula and the apparent symmetry of the AUC sampling distributions in our example imply that the delta method we have used will also be highly accurate to the first order with sample sizes such as we encountered in our example. For particularly small sample sizes, the accuracy of the delta method might need to be examined by Monte Carlo simulation. Inconsistent reporting of results is a very common problem in the medical literature regarding meta-analyses of diagnostic tests. To date, the medical literature includes 167 meta-analyses of evaluations of diagnostic tests (ranging from Doppler echocardiography and PSA tests to various diagnostic tests in the cancer literature), using ROCs to pool the data. More than half of these meta-analyses describe significant limitations due to (1) lack of reporting of raw sensitivity and specificity data to construct ROC curves, and (2) the use of different metrics which limit the authors abilities to pool all of the data. Our paper provides authors with a method to use AUC data (rather than sensitivity and specificity values), which are commonly presented in studies of diagnostic tests. The meta-analyst should be concerned about the possibility of asymmetry in the ROC curve from his/her component studies. However, as we noted earlier the AUC is very stable and robust to asymmetry. In fact, the AUC changes by less than 2% for practical cases of interest and the impact of asymmetry on the SE of the AUC is also rather small (see Ref. [7] for details). Furthermore, SE(AUC) declines with increasing asymmetry, so assuming the symmetric value of the SE will be conservative. With these points in mind, we feel that it is reasonable to use the AUC approach and its SE from the symmetric case. A practical consideration is that many authors do not report the value of the regression coefficients, their SEs, and their covariance, all of which are required to estimate SE(AUC) in the general case. Hence, there is then no alternative except to use the implied symmetric values of AUC and its SE. Strong asymmetry in an ROC curve would, however, be a concern, and indeed in such cases it may be undesirable to even attempt a meta-analysis at all with such studies. Nevertheless, to the extent that one trusts the AUC as a summary measure of the data (which, as noted above, seems reasonable in both symmetric and moderately asymmetric situations) one can proceed. It is only when the asymmetry is extreme that the method would break down. Prior to conducting the meta-analysis in our primary study [4], we assessed the symmetry of the ROC curves, whenever this information was available. There was at most only modest asymmetry of the ROC curves in the individual studies. Hence, we determined that it would be methodologically appropriate to conduct a meta-analysis. The conversion demonstrated in this paper permits the meta-analyst to overcome some of the difficulties implied by incomplete and inconsistent reporting of diagnostic or predictive studies. However, it should be recognized that this is still only a partial substitute for a more detailed analysis using individual patient data from all component studies. Acknowledgments The authors thank Drs. Neill Adhikari, Deborah Cook, Holger Schünemann, Lauren Griffith, and Graeme Rocker for their advice and contributions to the statistical analysis of the original systematic review. The work was partly supported by funding from the Natural Sciences and Engineering and Research Council. Dr. Sinuff is supported by a CIHR Clinician Scientist award. References [1] Morris SB, DeShon RP. Combining effect size estimates in metaanalysis with repeated measures and independent-groups designs. Psychol Methods 2002;7:105e25. [2] Deeks JJ, Altman DG, Bradburn MJ. Statistical methods for examining heterogeneity and combining results from several studies in meta-analysis. In: Egger M, Smith GD, Altman DG, editors. Systematic reviews in health care: meta-analysis in context. 2nd ed. London: BMJ Books; 2001. p. 285e312.

534 S.D. Walter, T. Sinuff / Journal of Clinical Epidemiology 60 (2007) 530e534 [3] Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982;143: 29e36. [4] Sinuff T, Adhikari NKJ, Cook DJ, Schünemann HJ, Griffith LE, Rocker G, et al. Mortality predictions in the intensive care unit: comparing physicians to scoring systems. Crit Care Med 2006;34: 878e85. [5] Moses LE, Shapiro D, Littenberg B. Combining independent studies of a diagnostic test into a summary ROC curve: data-analytic approaches and come additional considerations. Stat Med 1993;12: 1293e316. [6] Littenberg B, Moses LE. Estimating accuracy from multiple conflicting reports: a new analytic method. Med Decis Making 1993;13: 313e21. [7] Walter SD. Properties of the summary receiver operating characteristic (SROC) curve for diagnostic test data. Stat Med 2002;21:1237e56. [8] Zhou XA, Obuchowski NA, McClish DK. Statistical methods in diagnostic medicine. New York, NY: John Wiley & Sons, Inc.; 2002. [9] Christensen C, Cottrell JJ, Murakami J, Mackesy ME, Fetzer B, Elstein AS, et al. Forecasting survival in the medical intensive care unit: a comparison of clinical prognoses with formal estimates. Methods Inf Med 1993;32:302e8. [10] Poses RM, McClish DK, Bekes C, Scott WE, Morely JN. Ego bias, reverse ego bias, and physicians prognostic. Crit Care Med 1991; 19:1533e9. [11] Knaus WA, Harrell FE Jr, Lynn J, Goldman L, Phillips RS, Connore AF, et al. The SUPPORT prognostic model. Objective estimates of survival for seriously ill hospitalized adults. Ann Intern Med 1995;122:191e203. [12] Garrouste-Orgeas M, Montuchad L, Timsit J-F, Misset B, Christias M, Carlet J. Triaging patients to the ICU: a pilot study of factors influencing admission decisions and patient outcomes. Intensive Care Med 2003;29:774e81. [13] Scholz N, Basler K, Saur P, Burchardi H, Felder S. Outcome prediction in critical care: physicians prognoses vs scoring systems. Eur J Anaesthesiol 2004;21:606e11. [14] Armitage P, Berry G, Matthews JNS, editors. Statistical methods in medical research. 4th ed. Oxford: Blackwell Science; 2002. [15] Brannen AL II, Godfrey LJ, Goetter WE. Prediction of outcome from critical illness. A comparison of clinical judgment with a prediction rule. Arch Intern Med 1999;149:1083e6. [16] Copeland-Fields L, Griffin T, Jenkins T, Buckley M, Wise LC. Comparison of outcome predictions made by physicians, by nurses, and by using the mortality prediction model. Am J Crit Care 2001;10:313e9. [17] Meyer AA, Messick WJ, Young P, Baker CC, Fakhry S, Muakkassa F, et al. Prospective comparison of clinical judgment and APACHE II score in predicting the outcome in critically ill surgical patients. J Trauma 1992;32:747e54. [18] Marks RJ, Simons RS, Blizzard RA, Browne DRG. Predicting outcome in intensive therapy unitsda comparison of APACHE II with subjective assessment. Intensive Care Med 1991;17:159e63. [19] Chang RWS, Lee B, Jacobs S, Lee B. Accuracy of decisions to withdraw therapy in critically ill patients: clinical judgment versus a computer model. Crit Care Med 1989;12:1091e7. [20] Kruse JA, Thill-Baharozian MC, Carlson RW. Comparison of clinical assessment with APACHE II for predicting mortality risk in patients admitted to a medical intensive care unit. JAMA 1988;260: 1739e42. [21] Katzman-McClish D, Powell SH. How well can physicians estimate mortality in a medical intensive care unit? Med Decis Making 1989;9:125e32.