Comparison of the EuroSCORE II and Society of Thoracic Surgeons 2008 risk tools

Similar documents
FEV1 predicts length of stay and in-hospital mortality in patients undergoing cardiac surgery

Copyright by ICR Publishers 2005

Risk Score for Predicting In-Hospital/30-Day Mortality for Patients Undergoing Valve and Valve/ Coronary Artery Bypass Graft Surgery

Outcomes of cardiac surgery in Indigenous Australians

Divisions of Cardiology and Cardiovascular Surgery, Veterans Administration Medical Center and University of Minnesota, Minneapolis, Minnesota

Clinical material and methods. Copyright by ICR Publishers 2007

Predictive models for kidney disease: improving global outcomes (KDIGO) defined acute kidney injury in UK cardiac surgery

Randomized comparison of single versus double mammary coronary artery bypass grafting: 5 year outcomes of the Arterial Revascularization Trial

Obesity and early complications after cardiac surgery

Risk stratification for in-hospital mortality after cardiac surgery: external validation of EuroSCORE II in a prospective regional registry

Supplementary Appendix

EACTS Adult Cardiac Database

SUPPLEMENTAL MATERIAL

Respiratory failure (RF), or prolonged mechanical ventilation,

EuroSCORE Predicts Short- and Mid-Term Mortality in Combined Aortic Valve Replacement and Coronary Artery Bypass Patients

Supplementary Table S1: Proportion of missing values presents in the original dataset

Cardiac surgery in Victorian public hospitals, Public report

The European System for Cardiac Operative Risk. Validation of EuroSCORE II in Patients Undergoing Coronary Artery Bypass Surgery

The Society for Cardiothoracic Surgery in Great Britain

Risk-prediction for postoperative major morbidity in coronary surgery

Contemporary outcomes for surgical mitral valve repair: A benchmark for evaluating emerging mitral valve technology

The Society of Thoracic Surgeons: 30-Day Operative Mortality and Morbidity Risk Models

The comparison of crude mortality among different surgeons,

Technical Notes for PHC4 s Report on CABG and Valve Surgery Calendar Year 2005

Intraoperative application of Cytosorb in cardiac surgery

Preoperative Prediction of Postoperative Morbidity in Coronary Artery Bypass Grafting

Current use of preoperative intra-aortic balloon counterpulsation in high-risk cardiac surgery: a cohort study

Logistic versus additive EuroSCORE. A comparative assessment of the two models in an independent population sample

Supplementary Online Content

Supplementary Appendix

Setting The setting was a hospital. The economic study was carried out in Australia.

Left Internal Mammary Artery to the Left Anterior Descending Artery: Effect on Morbidity and Mortality and Reasons for Nonusage

Chairman and O. Wayne Isom Professor Department of Cardiothoracic Surgery Weill Cornell Medicine

Chapter 17 Sensitivity Analysis and Model Validation

Validation of the Euroscore and the ACEF score on cardiac surgery patients at Kenyatta National Hospital Nairobi, Kenya.

Department of Thoracic and Cardiovascular Surgery, West-German Heart Center, University of Duisburg-Essen, Essen, Germany

External Validation of European System for Cardiac Operative Risk Evaluation II (EuroSCORE II) for Risk Prioritization in an Iranian Population

Prediction of acute renal failure after cardiac surgery: retrospective cross-validation of a clinical algorithm

EuroSCORE Predicts Intensive Care Unit Stay and Costs of Open Heart Surgery

Decreasing Mortality for Aortic and Mitral Valve Surgery In Northern New England

Effect of Body Mass Index on Early Outcomes in Patients Undergoing Coronary Artery Bypass Surgery

Surgical Consensus Standards Endorsement Maintenance NQF-Endorsed Surgical Maintenance Standards (Phase I) Table of Contents

Ischemic Ventricular Septal Rupture

Measuring the risk in valve patients Lessons learnt from the TAVI story? Bernard Iung Bichat Hospital, Paris, France

Assessing Cardiac Risk in Noncardiac Surgery. Murali Sivarajan, M.D. Professor University of Washington Seattle, Washington

Valve Disease in Patients With Heart Failure TAVI or Surgery? Miguel Sousa Uva Hospital Cruz Vermelha Lisbon, Portugal

Interventional Cardiology

Does preoperative atrial fibrillation increase the risk for mortality and morbidity after coronary artery bypass grafting?

Transfusion and Blood Stream Infections after Coronary Surgery

Predictive Ability of Novel Cardiac Biomarkers ST2, Galectin-3, and NT-ProBNP Before Cardiac Surgery

Surgical Outcomes: A synopsis & commentary on the Cardiac Care Quality Indicators Report. May 2018

CARDIOVASCULAR N-terminal pro-b-type natriuretic peptide levels and early outcome after cardiac surgery: a prospective cohort study

Institute of Medical Epidemiology, Biostatistics, and Informatics, University of Halle-Wittenberg, Halle (Saale) 2

ORIGINAL ARTICLE. Peripheral Vascular Disease and Outcomes Following Coronary Artery Bypass Graft Surgery

Surgeon specific mortality in adult cardiac surgery: comparison between crude and risk stratified data

Paris, August 28 th Gian Paolo Ussia on behalf of the CoreValve Italian Registry Investigators

Why is co-morbidity important for cancer patients? Di Riley Associate Director Clinical Outcomes Programme

The MAIN-COMPARE Registry

Supplementary Online Content

University of Florida Department of Surgery. CardioThoracic Surgery VA Learning Objectives

Coronary Surgery in Patients With Peripheral Vascular Disease: Effect of Avoiding Cardiopulmonary Bypass

Critical Appraisal of Risk Adjusted Analysis and Public Reporting of Outcomes in Cardiac Surgery

Institute of Medical Epidemiology, Biostatistics, and Informatics, University of Halle-Wittenberg, Halle (Saale) 2

The American Experience

Minimally Invasive Mitral Valve Repair: Indications and Approach

University of Bristol - Explore Bristol Research

A Novel Score to Estimate the Risk of Pneumonia After Cardiac Surgery

ESC Congress 2011 SIMULTANEOUS HYBRID REVASCULARIZATION OF CAROTID AND CORONARY DISEASE INITIAL RESULTS OF A NEW THERAPEUTIC APPROACH

Journal of the American College of Cardiology Vol. 42, No. 10, by the American College of Cardiology Foundation ISSN /03/$30.

Importance of the third arterial graft in multiple arterial grafting strategies

Emergency surgery in acute coronary syndrome

Supplementary Appendix

Cardiovascular Surgery

Report on Coronary Artery Bypass Surgery in Ontario, Fiscal Years 2005/06 and 2006/07

Major Infection After Pediatric Cardiac Surgery: External Validation of Risk Estimation Model

Comparison of 30-day outcomes of coronary artery bypass grafting surgery verus hybrid coronary revascularization stratified by SYNTAX and euroscore

Foreword. Sir Bruce Keogh

Why is co-morbidity important for cancer patients? Michael Chapman Research Programme Manager

Original Article. Dynamic Prediction Modeling Approaches for Cardiac Surgery

Faculty/Presenter Disclosure

Marc Albert, Adrian Ursulescu, Ulrich FW Franke Department of Cardiovascular Surgery Robert-Bosch-Hospital, Stuttgart, Germany

Appropriate Patient Selection or Healthcare Rationing? Lessons from Surgical Aortic Valve Replacement in The PARTNER I Trial Wilson Y.

Cardiovascular Surgery. Bedside Tool for Predicting the Risk of Postoperative Dialysis in Patients Undergoing Cardiac Surgery

Ahmad Farouk Musa 1, Xian Pei Cheong 1, Jeswant Dillon 2, Rusli Bin Nordin 1

Outcomes evaluation is an essential part of maintaining

Title:Relation Between E/e' ratio and NT-proBNP Levels in Elderly Patients with Symptomatic Severe Aortic Stenosis

Off-Pump Cardiac Surgery is not Dead

Modeling and Risk Prediction in the Current Era of Interventional Cardiology

Dr. Stuart McCorkell BSc FRCA FFICM Anaesthetic Department, Guy s & St. Thomas s NHS Foundation Trust 2017 POPS

Is TAVR the treatment of choice for high risk diabetic patients with aortic stenosis? Insights from the FRANCE2 Registry

Validation of the EuroSCORE risk models in Turkish adult cardiac surgical population,

NIH Public Access Author Manuscript World J Cardiovasc Surg. Author manuscript; available in PMC 2014 January 28.

In the United States, 97 million overweight or obese

Evaluation of cardiac surgery mortality rates: 30-day mortality or longer follow-up?

Lucia Cea Soriano 1, Saga Johansson 2, Bergur Stefansson 2 and Luis A García Rodríguez 1*

When Should We Consider TAVI. (Surgeon s Viewpoint)? Pyowon Park Samsung Medical Center Seoul, Korea

The MAIN-COMPARE Study

Impaired Chronotropic Response to Exercise Stress Testing in Patients with Diabetes Predicts Future Cardiovascular Events

Transcription:

European Journal of Cardio-Thoracic Surgery 44 (2013) 999 1005 doi:10.1093/ejcts/ezt122 Advance Access publication 4 March 2013 ORIGINAL ARTICLE Comparison of the EuroSCORE II and Society of Thoracic Surgeons 2008 risk tools Bilal H. Kirmani, Khurum Mazhar, Brian M. Fabri and D. Mark Pullan* Department of Cardiac Surgery, Liverpool Heart and Chest Hospital, Liverpool, UK * Corresponding author. Department of Cardiothoracic Surgery, Liverpool Heart and Chest Hospital, Thomas Drive, Liverpool L13 3PE, UK. Tel: +44-151-6001397; fax: +44-151-2932254; e-mail: mark.pullan@lhch.nhs.uk (D.M. Pullan). Received 16 September 2012; received in revised form 22 January 2013; accepted 3 February 2013 Abstract OBJECTIVES: Risk stratification in cardiac surgery is uniquely detailed, led latterly by the EuroSCORE and the Society of Thoracic Surgeons (STS) risk calculators. The recently published EuroSCORE II (ES2) algorithms update estimated mortality in a broad spectrum of cardiac procedures. The 2008 STS tool, in comparison, predicts multiple outcomes for specific procedures. We sought to identify and compare the external validity of both contemporaneous tools in our population. METHODS: Data from our hospital database were collated for the period February 2001 to March 2010. Logistic regression coefficients from the risk calculations were applied to the data and the results presented as receiver-operating characteristic (ROC) curves. Statistical analyses were performed using the area under the ROC curve (AUROC) and the Hosmer Lemeshow (H-L) goodness-of-fit test, with comparisons using the DeLong method. RESULTS: A total of 15 497 procedures were identified, of which 14 432 were appropriate for STS risk scoring (i.e. valve and/or graft procedures with no tricuspid valve operations etc.). For all procedures, ES2 and STS were equivalent (AUROC 0.818 vs 0.805, respectively, P = 0.343). For procedures appropriate for STS risk scoring, results were similar (AUROC ES2 vs STS, 0.816 vs 0.810, P = 0.714), whereas for procedures excluded by STS, the result was marginally worse (AUROC ES2 vs STS, 0.773 vs 0.784, P = 0.751). Goodness of fit in all cases was poor, primarily where risk was higher than 15% (H-L P < 0.0001). CONCLUSIONS: EuroSCORE II and STS both provide equivalent discrimination in predicting mortality in a British population, including those undergoing procedures for which the STS does not normally predict. Accounting for decile-grouped Hosmer Lemeshow tests not being ideal for the assessment of calibration, both tools show good calibration for patients with low to moderate risk, with divergence from 15% predicted risk. Keywords: Statistics Risk analysis/modelling Surgery Complications Sternum Infection INTRODUCTION Twelve years ago, Geissler et al. [1] published a comparison of six contemporaneous cardiac surgical risk scores, concluding at the time that the then-new Euro score [sic] yielded the best predictive value for mortality. In the intervening decade, two risk calculators have come to predominate: The Society of Thoracic Surgeons (STS) Risk Score in North America and the EuroSCORE in Europe. The former had not been developed at the time of Geissler s review, while the EuroSCORE risk calculator [2] has continued to demonstrate itself as a well-established and validated tool [3, 4]. Despite evidence that the EuroSCORE model retained excellent discrimination over more than a decade, EuroSCORE was renewed earlier this year with modifications, based on logistic regression analyses of 23 000 patients in 150 hospitals in 43 countries [5]. EuroSCORE II retains the parsimonious approach of Presented at the 26th Annual Meeting of the European Association for Cardio-Thoracic Surgery, Barcelona, Spain, 27 31 October 2012. its predecessor, with 18 variables making up the risk score, demonstrating the authors priority of clinical usability when developing the score [6]. The Society of Thoracic Surgeons Risk Score was most recently updated in 2008 [7 10] from a multicentre database of over 100 000 index procedures in the United States. Using 67 demographic and operative parameters, the STS Risk Tool calculates predicted mortality for coronary artery bypass grafting, valve procedures and combined cases. In addition, the extensive calculator estimates a number of other comorbidities including renal failure, cerebrovascular accident, deep sternal wound infections and risk of prolonged ITU stay. Since publication, the EuroSCORE II has been validated in large, European, multicentre cohorts of both historical patients and those operated on since the data used to design the calculator was collected. Barili et al. [11] found that the performance of the tool was more than satisfactory, and well calibrated for a large proportion of patients, although in the higher tertiles of risk it did not improve on the original additive or logistic The Author 2013. Published by Oxford University Press on behalf of the European Association for Cardio-Thoracic Surgery. All rights reserved.

1000 B.H. Kirmani et al. / European Journal of Cardio-Thoracic Surgery EuroSCORE. Grant et al. [12] described it as an acceptable generic cardiac surgery risk model, but found that it was poorly calibrated for the highest and lowest risk patients. To date, a direct comparison of the two most recent and commonly used risk calculators worldwide has not been made. We sought to establish the validity and predictive powers of the EuroSCORE II and the 2008 Society of Thoracic Surgeons risk scores in a large cohort of patients undergoing a variety of cardiac surgical procedures. MATERIALS AND METHODS The study was approved as an audit by the local ethics committee and patient consent was waived. Table 1: Demographic data All patients (N = 15 497) CABG and valve (N = 1858) CABG only (N = 9559) Valve only (N = 4080) Male, n (%) 11 217 (72.4) 1273 (68.5) 7689 (80.4) 2255 (55.3) Age, years ± SD 65.3 ± 11.0 71.0 ± 8.4 65.1 ± 9.1 63.0 ± 14.5 BMI, kg/m 2 ± SD 28.1 ± 4.8 27.6 ± 4.6 28.6 ± 4.5 27.1 ± 5.2 Creatinine, μmol/l ± SD 101.5 ± 54.8 107.6 ± 56.0 100.2 ± 49.8 101.9 ± 64.3 Renal failure, n (%) ARI 117 (0.7) 15 (0.8) 36 (0.4) 66 (1.6) CRI 806 (5.2) 147 (7.9) 437 (4.6) 222 (5.5) CRF 98 (0.6) 16 (0.9) 47 (0.5) 35 (0.9) Ejection fraction, n (%) Good 9566 (61.8) 1112 (59.9) 5712 (59.8) 2742 (67.6) Moderate 4335 (28.0) 522 (28.1) 2930 (30.7) 883 (21.8) Poor 1346 (8.7) 211 (11.4) 873 (9.1) 262 (6.5) Atrial fibrillation 1604 (10.4) 345 (18.6) 343 (3.6) 916 (22.5) NYHA class, n (%) I 3884 (25.1) 197 (10.6) 2897 (30.3) 79.0 (19.4) II 5931 (38.3) 625 (33.7) 4223 (44.2) 1083 (26.6) III 5024 (32.4) 904 (48.7) 2297 (24.0) 1823 (44.8) IV 647 (4.2) 131 (7.1) 140 (1.5) 376 (9.2) CCS class, n (%) I 1032 (6.7) 154 (8.3) 622 (6.7) 236 (5.8) II 3538 (22.8) 437 (25.5) 2703 (28.3) 362 (8.9) III 4136 (26.7) 462 (24.9) 3508 (36.7) 166 (4.1) IV 1804 (2.3) 139 (7.5) 1624 (17.0) 41 (1.0) Unstable angina 2268 (14.6) 197 (10.6) 2011 (21.0) 60 (1.5) Chronic airways disease 1253 (8.1) 208 (11.2) 719 (7.5) 326 (8.0) PVD 1903 (12.3) 277 (14.9) 1368 (14.3) 258 (6.3) Diabetes mellitus, n (%) Diet controlled 633 (4.1) 92 (4.9) 429 (4.5) 112 (2.8) Oral meds 1595 (10.3) 192 (10.3) 1156 (12.1) 247 (6.1) Insulin 800 (5.2) 83 (4.5) 624 (6.5) 93 (2.3) Coronary artery disease extent, n (%) 1-vessel 1123 (9.1) 637 (34.5) 334 (3.5) 152 (16.7) 2-vessel 2428 (19.7) 537 (29.1) 1870 (19.6) 21 (2.3) 3-vessel 8076 (65.6) 666 (36.1) 7344 (76.9) 66 (7.2) Pre-op IABP, n (%) 244 (1.6) 33 (1.8) 185 (1.9) 26 (0.6) Pre-op shock, n (%) 99 (0.6) 15 (0.8) 42 (0.4) 42 (1.0) Hypertension, n (%) 9073 (58.5) 1142 (61.5) 6193 (64.8) 1738 (42.6) Left main stem disease, n (%) 2475 (16.0) 198 (10.7) 2270 (23.7) 7 (0.2) Q-wave MI, n (%) 5581 (36.0) 555 (30.0) 4784 (50.0) 242 (5.9) First operation, n (%) 14 377 (92.8) 1730 (93.2) 9192 (96.2) 3455 (84.8) Status, n (%) Urgent 25 55 (16.5) 286 (15.4) 1791 (18.7) 478 (11.7) Emergent/salvage 346 (2.2) 21 (1.1) 130 (1.3) 195 (4.7) Smoking, n (%) Current 1967 (12.7) 180 (9.7) 1308 (13.7) 479 (11.8) Ex-smoker 8346 (53.9) 1088 (58.6) 5524 (57.8) 1734 (42.7) Cerebrovascular disease, n (%) 1237 (8.0) 188 (10.1) 718 (7.5) 331 (8.1) Ventilation pre-op, n (%) 54 (0.3) 7 (0.4) 14 (0.1) 33 (0.8) Inotropes pre-op, n (%) 1 611 (4.5) 100 (6.2) 297 (3.5) 214 (5.9) 2 278 (2.0) 64 (4.0) 113 (1.3) 101 (2.8) 3 82 (0.6) 27 (1.7) 20 (0.2) 35 (1.0) FEV1, l ± SD 79.8 ± 29.3 78.2 ± 27.5 82.4 ± 28.5 74.3 ± 31.0 CABG: coronary artery bypass grafting; BMI: body mass index; ARI: acute renal injury; CRI: chronic renal injury; CRF: chronic renal failure; NYHA: New York Heart Association; CCS: Canadian Cardiovascular Society; PVD: peripheral vascular disease; IABP: intra-aortic balloon pump; MI: myocardial infarction; FEV1: forced expiratory volume in 1 s.

B.H. Kirmani et al. / European Journal of Cardio-Thoracic Surgery 1001 Figure 1: Receiver operator curve for EuroSCORE II in all cases, AUROC: 0.818. Figure 2: Receiver operator curve for STS 2008 in all cases, AUROC: 0.805. Study design and population Data from our cardiac surgery database at the Liverpool Heart and Chest Hospital, UK, was gathered for all patients undergoing procedures between 1 February 2001 and 31 March 2010, inclusive. The database is accredited by the Society for Cardiothoracic Surgery in Great Britain and Ireland. Our institution contributed data to the EuroSCORE II project from May to July 2010 and, therefore, not during the period relevant to this paper. During the study period, a total of 15 499 cardiac operations with an index valve, graft or combined procedure were performed. Excluding those procedures that were not applicable for risk stratification by the Society of Thoracic Surgeons risk score (e.g. tricuspid valve procedures, other non-cardiac procedures, left ventricular aneurysm repair, ventricular or atrial septal defect Figure 3: Observed vs expected mortality using EuroSCORE II for (A) all cases; (B) STS-applicable cases; and (C) STS not-applicable cases. Solid line: expected (predicted) mortality, points; vertical bars: mean and 95% confidence interval of observed mortality for each decile in the Hosmer Lemeshow test. repairs, cardiac trauma, AF-ablation surgery, aortic aneurysm procedures etc.), a total of 14 432 procedures remained. These included 9477 coronary artery bypass grafts; 2964 valve procedures and 1726 combined valve and graft operations. Demographic information for the whole population is summarized in Table 1, and included all risk factors described by the Society of Thoracic Surgeons 2008 risk score with the exception of

1002 B.H. Kirmani et al. / European Journal of Cardio-Thoracic Surgery ethnic origin, which was not routinely collected. All factors for calculating EuroSCORE II risks were available, with the exception of poor mobility due to musculoskeletal or neurological dysfunction. The various algorithms for calculating risk were applied to these records, to calculate EuroSCORE II and STS 2008 risk scores for each patient. For the subgroup not applicable to STS risk scoring, we applied the same logistic regression functions for each parameter from the index operation (valve, graft or valve and graft). Where data were missing, we assumed a null value for categorical variables, assuming a disease state was absent. Where a continuous variable was missing, we applied the algorithm for imputed missing data dictated by the STS calculator. The EuroSCORE II Score ranged from 0.48 to 74.80% (mean 2.44, 95% CI: 2.39 2.50). The AUROC was 0.816, with an H-L goodness-of-fit statistic P value of <0.001. The STS Risk Score ranged from 0.20 to 72.78% (mean 2.40, 95% CI: 2.35 2.45). The AUROC for risk of perioperative death was 0.810, with an H-L goodness-of-fit statistic P value of <0.001. Comparison between the two scoring systems showed no significant difference (P = 0.714) Definitions Where differences existed in the standard units of measurement between the USA and the UK (e.g. serum creatinine), appropriate conversions were applied for calculations, but the data presented here are in British units. Quantitative measurements that referred to qualitative descriptors in the STS score (i.e. chronic lung disease, which is stratified as mild, moderate or severe in the STS risk variables but was measured at our centre as forced expiratory volume in 1 s) were graded according to published criteria [13]. Statistical analysis Statistical tests were performed using JMP 9.0.2 for Mac (SAS Institute, Inc., Cary, NC, USA) and R software (R Core Team, Vienna, Austria) [14]. The receiver-operating curve (ROC) was employed to test the discrimination of each model using the area under the receiver operator curve (AUROC) or C-statistic. The calibration of the model was interrogated using the Hosmer Lemeshow (H-L) test for goodness of fit using a standard decile format. Comparison between ROC curves was performed using DeLong s method [15, 16]. RESULTS All procedures For all 15 497 procedures undertaken at our institution, the overall in-hospital mortality was 547 patients, or 3.5% (95% CI: 3.3 3.8). The EuroSCORE II Score ranged from 0.48 to 74.80% (mean 2.53, 95% CI: 2.48 2.59). The AUROC was 0.818, with an H-L goodness-of-fit statistic P value of <0.001.The STS Risk Score ranged from 0.20 to 72.78% (mean 2.39, 95% CI: 2.34 2.45). The AUROC for risk of perioperative death was 0.805, with an H-L goodness-of-fit statistic P < 0.001. Comparison between the two scoring systems by DeLong s method showed no significant difference (P = 0.343). Society of Thoracic Surgeons-applicable procedure For 14 432 patients in whom the STS score could be applied, the overall mortality in hospital was 447 or 3.1% (95% CI: 2.8 3.4). Figure 4: Observed vs expected outcome using STS 2008 for (A) all cases; (B) STS-applicable cases; and (C) STS not-applicable cases. Solid line: expected (predicted) mortality, points; vertical bars: mean and 95% confidence interval of observed mortality for each decile in the Hosmer Lemeshow test.

B.H. Kirmani et al. / European Journal of Cardio-Thoracic Surgery 1003 Table 2: Example Hosmer Lemeshow Deciles Table (EuroSCORE II for all cases) Group Probability interval Total Mortality Survival Observed Expected Observed Expected 1 0.0048 0.0066 1614 4 9.2 1610.0 1604.8 2 0.0066 0.0080 1475 7 11.0 1468.0 1464.0 3 0.0080 0.0094 1537 9 14.1 1528.0 1522.9 4 0.0094 0.0112 1541 10 17.1 1531.0 1523.9 5 0.0112 0.0133 1541 14 20.6 1527.0 1520.4 6 0.0133 0.0161 1541 32 25.4 1509.0 1515.6 7 0.0161 0.0200 1542 47 32.6 1495.0 1509.4 8 0.0200 0.0254 1542 69 42.9 1473.0 1499.1 9 0.0254 0.0339 1542 101 62.9 1441.0 1479.1 10 0.0339 0.7480 1541 236 155.2 1305.0 1385.8 Procedures not applicable to Society of Thoracic Surgeons scoring One thousand and sixty-five patients underwent operations for which the 2008 STS Risk Calculator does not claim to stratify. The mortality in this group was 9.3% (95% CI: 7.8 11.3). The EuroSCORE II Score ranged from 0.48 to 52.1% (mean 3.8, 95% CI: 3.5 4.1). The AUROC was 0.773, with an H-L goodness-of-fit statistic P value of <0.001. The STS Risk Score ranged from 0.3 to 21.3% (mean 2.3, 95% CI: 2.1 2.5). The AUROC was 0.784, with an H-L goodness-of-fit statistic P value of <0.001. The difference between the two AUROCs was not statistically significant (P = 0.751). Graphical representations of these results are shown in Figs 1, 2, 3 and 4. COMMENT The discrimination of both the EuroSCORE II and STS 2008 risk stratification calculators in our institution was moderate to good, with AUROC values between 0.77 and 0.82 for all patients, including those not normally stratified by the STS score. The calibration of both tools was, however, inferred to be consistently poor with H-L goodness-of-fit tests being statistically significant in all cases (P < 0.001). As Sergeant et al. noted, however, a significant H-L statistic does not necessarily mean that a model is not well calibrated [17] and early papers citing the superiority of the original EuroSCORE over other risk stratification tools did not employ goodness-of-fit tests [3]. Indeed, even in the original publication of the EuroSCORE II, the validation of the model demonstrated a goodness of fit that was only just non-significant at P = 0.0505 when applied to the whole dataset. This was later shown to be P = 0.09316 in the validation subgroup [18]. The H-L performs a χ 2 test comparing the observed and expected outcomes for deciles of each test group (Table 2). Any divergence of the observed and expected outcomes leads to a statistically significant difference, whether or not the divergence is consistent or at a single averaged decile. At higher predicted mortalities, both models tend to under-predict (Figs 3 and 4). The H-L test may, however, be poorly equipped to deal with the weighted, non-parametric distribution of risks of a large test population. Decile subgrouping of several thousand patients with narrow error at the left end of the distribution leads to averaging errors in the highest risk group. This is illustrated in the Figure 5: Observed vs expected outcomes for EuroSCORE II using decile and percentile grouping. Solid line: expected (predicted) mortality, points; vertical bars: mean and 95% confidence interval of observed mortality for each group in the Hosmer Lemeshow test. probability intervals of the decile groups for the H-L test (Table 2). The first nine deciles, i.e. 90% of the patients, have a predicted risk of <5%. The final decile represents 1500 of the highest risk patients, ranging from 3.3 to 74.8%, where around a third of these were predicted to be at risk of >10%, and only 5 patients had a predicted mortality of >50%. When the distribution is over 100 subgroups instead, however, (i.e. percentiles

1004 B.H. Kirmani et al. / European Journal of Cardio-Thoracic Surgery The STS risk score for morbidity is not frequently utilized in the British setting due to the popularity the original additive EuroSCORE derived from it s back of the envelope simplicity. The logistic EuroSCORE maintained the interest of early adopters as it utilized the same, limited parameters, and both the additive and logistic EuroSCORE I were easy and quick to calculate validate and allow accurate risk prediction. With EuroSCORE II, the model calibration continues to improve with a parsimonious array of factors that maintain the original calculator s usability. Critics of both EuroSCORE and STS calculators point out that certain risk factors are unaccounted for and that the interaction between risk factors is not always represented. STS does so with a number of select interactions. Our study suggests that careful choice of a limited number of independent risk factors can provide as effective a risk model as one that includes calculations for multiple interactions. Until the advent of more advanced, accurate and instantly available prediction models, the assessment between two equally discriminating tools falls in favour of the one that is easier to use. Conflict of interest: none declared. REFERENCES Figure 6: Observed vs expected outcomes for STS 2008 using (A) decile and (B) percentile grouping. Solid line: expected (predicted) mortality, points; vertical bars: mean and 95% confidence interval of observed mortality for each group in the Hosmer Lemeshow test. instead of deciles) these averaging errors are minimized, and the observed and expected curves for both risk calculators show more coherence (Figs 5 and 6). Divergence still occurs, but begins at 10 15% risk rather than 2 3% as comparisons are made of patients at similar risk strata. At this risk level, the range of expected mortality increases due to a larger variability of comorbidities in the risk groups and smaller numbers. The strength of the calculators is presumably also lower due to the relatively lower number of high-risk patients in the modelling populations. The conclusions drawn from the results of the H-L tests must, therefore, be guarded and a clear distinction must be made between the ability of a model to discriminate between highand low-risk patients, which both tools do effectively, and the failure of a model to calibrate through all risk stratifications. The observed vs expected mortality graphs show that both calculators distinguish well between patients at different risk, even if the precise quantification of that risk is imperfect. Much has already been written about the use of the H-L tool in prediction validation [6, 12, 17, 19 22], and we retain our own reservations about the statistical methods employed to demonstrate poor calibration of the calculators. This view was recently corroborated by one of the original authors of the H-L test with statistical proof that the decile-based test is not appropriate for large populations [23]. [1] Geissler HJ, Hölzl P, Marohl S, Kuhn-Régnier F, Mehlhorn U, Südkamp M et al. Risk stratification in heart surgery: comparison of six score systems. Eur J Cardiothorac Surg 2000;17:400 6. [2] Nashef SAM, Roques F, Michel P, Gauducheau E, Lemeshow S, Salamon R, the Euro SCORE study group. European system for cardiac operative risk evaluation (EuroSCORE). Eur J Cardiothorac Surg 1999;16:9 13. [3] Gogbashian A, Sedrakyan A, Treasure T. EuroSCORE: a systematic review of international performance. Eur J Cardiothorac Surg 2004;25:695 700. [4] Nashef SAM, Roques F, Hammill BG, Peterson ED, Michel P, Grover FL et al. Validation of European System for Cardiac Operative Risk Evaluation (EuroSCORE) in North American cardiac surgery. Eur J Cardiothorac Surg 2002;22:101 5. [5] Nashef SAM, Roques F, Sharples LD, Nilsson J, Smith C, Goldstone AR et al. EuroSCORE II. Eur J Cardiothorac Surg 2012;41:734 45. [6] Nashef SAM, Sharples LD, Roques F, Lockowandt U. EuroSCORE II and the art and science of risk modelling. Eur J Cardiothorac Surg 2013;43: 695 6. [7] Shahian DM, Edwards FH. The Society of Thoracic Surgeons 2008 Cardiac Surgery Risk Models: Introduction. Ann Thorac Surg 2009;88:S1. [8] Shahian DM, O Brien SM, Filardo G, Ferraris VA, Haan CK, Rich JB et al. The Society of Thoracic Surgeons 2008 Cardiac Surgery Risk Models: part 1 coronary artery bypass grafting surgery. Ann Thorac Surg 2009; 88:S2 22. [9] O Brien SM, Shahian DM, Filardo G, Ferraris VA, Haan CK, Rich JB et al. The Society of Thoracic Surgeons 2008 Cardiac Surgery Risk Models: part 2 isolated valve surgery. Ann Thorac Surg 2009;88:S23 42. [10] Shahian DM, O Brien SM, Filardo G, Ferraris VA, Haan CK, Rich JB et al. The Society of Thoracic Surgeons 2008 Cardiac Surgery Risk Models: part 3 valve plus coronary artery bypass grafting surgery. Ann Thorac Surg 2009;88:S43 62. [11] Barili F, Pacini D, Capo A, Rasovic O, Grossi C, Alamanni F et al. Does EuroSCORE II perform better than its original versions? A multicentre validation study. Eur Heart J 2013;34:22 9. [12] Grant SW, Hickey GL, Dimarakis I, Trivedi U, Bryan A, Treasure T et al. How does EuroSCORE II perform in UK cardiac surgery: an analysis of 23 740 patients from the Society for Cardiothoracic Surgery in Great Britain and Ireland National Database. Heart 2012;98:1568 72. [13] Rabe KF, Hurd S, Anzueto A, Barnes PJ, Buist SA, Calverley P et al. Global strategy for the diagnosis, management, and prevention of chronic obstructive pulmonary disease GOLD executive summary. Am J Respir Crit Care Med 2007;176:532 55. [14] R Development Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2011.

B.H. Kirmani et al. / European Journal of Cardio-Thoracic Surgery 1005 [15] Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez J-C et al. proc: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 2011;12:77. [16] DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 1988;44:837 45. [17] Sergeant P, Meuris B, Pettinari M. EuroSCORE II, illum qui est gravitates magni observe*. Eur J Cardiothorac Surg 2012;41:729 31. [18] Sharples LD, Nashef SAM. Reply to Hickey and Bridgewater. Eur J Cardiothorac Surg 2012;43:208 209. [19] Chalmers J, Pullan M, Fabri B, McShane J, Shaw M, Mediratta N et al. Validation of EuroSCORE II in a modern cohort of patients undergoing cardiac surgery. Eur J Cardiothorac Surg 2013;43:688 94. [20] Hickey GL, Bridgewater B. How well calibrated is EuroSCORE II? Eur J Cardiothorac Surg 2012;43:208. [21] Siregar S, Groenwold RHH, de Heer F, Bots ML, van der Graaf Y, van Herwerden LA. Performance of the original EuroSCORE. Eur J Cardiothorac Surg 2012;41:746 54. [22] Yap C-H, Reid C, Yii M, Rowland MA, Mohajeri M, Skillington PD et al. Validation of the EuroSCORE model in Australia. Eur J Cardiothorac Surg 2006;29:441 6; discussion 446. [23] Paul P, Pennell ML, Lemeshow S. Standardizing the power of the Hosmer-Lemeshow goodness of fit test in large data sets. Stat Med 2013;32:67 80. APPENDIX. CONFERENCE DISCUSSION Dr M. Mack (Dallas, TX, USA): As we are aware, there are now 11 risk algorithms in cardiac surgery to predict outcomes and mortality, and one of the issues is that the algorithms are only accurate in the populations in which they are studied and in the time period in which they have been done. Because of TAVR, the logistic EuroSCORE specifically has been widely overused and abused in populations for which it was not intended, calibrated or meant to analyse. And all of these risk algorithms are, as you have alluded to, less accurate when you get to the extreme risks of populations, with logistic EuroSCORE usually over-predicting risk by a factor of three in the highest risk patients and STS slightly under-predicting risk. So I think your study demonstrates very well that although there still are problems in the extremes, we are much better than we were previously. I have three questions for you. The first is, as we heard from Mr Nashef s presentation, STS and EuroSCORE II use different definitions of mortality. What did you use for this when you looked at both; did you use different definitions of mortality, and when you reported your own mortality, which definition did you use? Dr Kirmani: We used inpatient mortality, simply for the reasons that Mr Nashef has alluded to, in that this data is readily available. Dr Mack: So that may be a problem because you did not use the STS definition of death. Dr Kirmani: Absolutely. Dr Mack: One would think that a death would be a death would be a death, i.e. that there is a common definition, but, as has been alluded to, there is not. The second question is that EuroSCORE II under-predicted the actual mortality in your institution overall. Do you think that is a problem with your programme or a problem with the risk prediction algorithm? Dr Kirmani: I think it is difficult to tease out exactly which of those it may be. I know that the 2008 scores have since been modified with a multiplier, and over a nine-year period it would be difficult to ascertain exactly which of those factors may have contributed. Dr Mack: And my last question is, despite Dr. Sergeant s protestations in an editorial that EuroSCORE not be used for TAVR risk prediction, that it actually is abuse, already it is being widely used for this. Have you looked at TAVR in your hospital to see whether this EuroSCORE more closely predicts outcomes and is more accurate? Dr Kirmani: Wehaven t done that yet. But, as you say, whether or not the tool it was designed for is accurate for something else, I think when you are looking at similar groups of patients, what you can do is use surrogate markers, and something like a EuroSCORE for a cardiac surgical procedure may give you a ballpark figure for something similar like a TAVI.