Washington, DC, November 9, 2009 Institute of Medicine

Holger Schünemann, MD, PhD Chair, Department of Clinical Epidemiology & Biostatistics Michael Gent Chair in Healthcare Research McMaster University, Hamilton, Canada Washington, DC, November 9, 2009 Institute of Medicine

Disclosure Documents editor for the American Thoracic Society Member of several WHO committees Executive committee member of the ACCP antithrombotic guidelines Co-convener of a Cochrane Methods Group Member of ACP COPD Guideline Panel Co-chair of the GRADE Working Group

Content Approaches to appraising evidence and developing recommendations Canadian Task Force Oxford Center for Evidence Based Medicine USPSTF SORT-Family Medicine Specialty Societies AHA/ACC CDC SIGN not described in detail GRADE Quality of evidence Strength of recommendation

Appraising evidence and developing recommendations To guide healthcare decision making, a guideline (panel) should weight the desirable and undesirable consequences related to that decision for the relevant setting on the basis of the best available evidence and integrate values and preferences. Evidence = observations in the world

Quality of Evidence In the context of making recommendations The quality of evidence reflects the extent to which our confidence in an estimate of the effect is adequate to support a particular recommendation Evidence grading systems are frameworks to assess the degree of this confidence Guyatt et al., 2008

Desirable and undesirable consequences desirable effects lower mortality improvement in quality of life, fewer hospitalizations reduction in the burden of treatment reduced resource expenditure undesirable consequences deleterious impact on morbidity, mortality or quality of life (including burden) increased resource expenditure

The origin of evidence appraisal systems Canadian Task Force on the Periodic Health Examination, CMAJ, 1979

Hierarchy of evidence STUDY DESIGN Randomized Controlled Trials Cohort Studies and Case Control Studies Case Reports and Case Series, Non-systematic observations Expert Opinion BIAS

Everything should be made as simple as possible but not simpler. (Albert Einstein)

Simple hierarchies are too simplistic Concealment of randomization Blinding (who is blinded in a double blinded trial?) Confounding, effect modification & ext. validity Intention to treat analysis and its correct application Why trials stopped early for benefit overestimate treatment effects? P-values and confidence intervals

Hierarchy of evidence STUDY DESIGN Randomized Controlled Trials Cohort Studies and Case Control Studies Case Reports and Case Series, Non-systematic observations Expert Opinion BIAS Expert Opinion Schünemann & Bone, 2003

Grade of Recommendation A B C D Oxford Centre for Evidence Based Medicine Levels of Evidence and Grades of Recommendations- 23 November 1999. Level of Evidence Therapy/Prevention, Aetiology/Harm Prognosis Diagnosis Economic analysis 1a SR (with homogeneity) of RCTs SR (with homogeneity*) of inception cohort studies; or a CPG validated on a test set. 1b Individual RCT (with narrow Confidence Interval) Individual inception cohort study with > 80% follow-up SR (with homogeneity*) of Level 1 diagnostic studies; or a CPG validated on a test set. Independent blind comparison of an appropriate spectrum of consecutive patients, all of whom have undergone both the diagnostic test and the reference standard. SR (with homogeneity*) of Level 1 economic studies Analysis comparing all (critically-validated) alternative outcomes against appropriate cost measurement, and including a sensitivity analysis incorporating clinically sensible variations in important variables. 1c All or none All or none case-series Absolute SpPins and SnNouts Clearly as good or better, but cheaper. Clearly as bad or worse but more expensive. Clearly better or worse at the same cost. 2a SR (with homogeneity*) of cohort studies SR (with homogeneity*) of either retrospective cohort studies or untreated control groups in RCTs. 2b Individual cohort study (including low quality RCT; e.g., <80% follow-up) Retrospective cohort study or follow-up of untreated control patients in an RCT; or CPG not validated in a test set. 2c Outcomes Research Outcomes Research SR (with homogeneity*) of Level >2 diagnostic studies Any of: Independent blind or objective comparison; Study performed in a set of non-consecutive patients, or confined to a narrow spectrum of study individuals (or both) all of whom have undergone both the diagnostic test and the reference standard; A diagnostic CPG not validated in a test set. 3a SR (with homogeneity*) of case-control studies 3b Individual Case-Control Study Independent blind comparison of an appropriate spectrum, but the reference standard was not applied to all study patients 4 Case-series (and poor quality cohort and case-control studies) 5 Expert opinion without explicit critical appraisal, or based on physiology, bench research or first principles Case-series (and poor quality prognostic cohort studies) Expert opinion without explicit critical appraisal, or based on physiology, bench research or first principles Oxford Centre for Evidence-Based Medicine (Chris Ball, Dave Sackett, Bob Phillips, Brian Haynes, and Sharon Straus). Any of: Reference standard was unobjective, unblinded or not independent; Positive and negative tests were verified using separate reference standards; Study was performed in an inappropriate spectrum** of patients. Expert opinion without explicit critical appraisal, or based on physiology, bench research or first principles SR (with homogeneity*) of Level >2 economic studies Analysis comparing a limited number of alternative outcomes against appropriate cost measurement, and including a sensitivity analysis incorporating clinically sensible variations in important variables. Analysis without accurate cost measurement, but including a sensitivity analysis incorporating clinically sensible variations in important variables. Analysis with no sensitivity analysis Expert opinion without explicit critical appraisal, or based on economic theory

USPSTF - Grade Definitions After May 2007: Certainty Level of Certainty High Moderate Low Description The available evidence usually includes consistent results from well-designed, well-conducted studies in representative primary care populations. These studies assess the effects of the preventive service on health outcomes. This conclusion is therefore unlikely to be strongly affected by the results of future studies. The available evidence is sufficient to determine the effects ofthe preventive service on health outcomes, but confidence in the estimate is constrained by such factors as: The number, size, or quality of individual studies. Inconsistency of findings across individual studies. Limited generalizability of findings to routine primary care practice. Lack of coherence in the chain of evidence. As more information becomes available, the magnitude or direction of the observed effect could change, and this change may be large enough to alter the conclusion. The available evidence is insufficient to assess effects on health outcomes. Evidence is insufficient because of: The limited number or size of studies. Important flaws in study design or methods. Inconsistency of findings across individual studies. Gaps in the chain of evidence. Findings not generalizable to routine primary care practice. Lack of information on important health outcomes. More information may allow estimation of effects on health outcomes. The USPSTF defines certainty as "likelihood that the USPSTF assessment of the net benefit of a preventive service is correct."

Recommendations for prognosis Use prognostic information to determine baseline risk for healthcare decisions

Center for Disease Control and Prevention (CDC) Evidence of Effectiveness Execution - Good or Fair Design Suitability Greatest, Moderate, or Least Number of Studies Consistent Effect Sized Expert Opinion Strong Good Greatest At Least 2 Yes Sufficient Not Used Good Greatest or At Least 5 Yes Sufficient Not Used Moderate Good or Fair Greatest At Least 5 Yes Sufficient Not Used Meet Design, Execution, Number, and Consistency Criteria for Sufficient But Not Strong Evidence Large Not Used Sufficient Good Greatest 1 Not Sufficient Not Used Applicable Good or Greatest or At Least 3 Yes Sufficient Not Used Fair Moderate Good or Greatest, At Least 5 Yes Sufficient Not Used Fair Moderate, or Least Expert Opinion Varies Varies Varies Varies Sufficient Supports a Recommendation Insufficient A. Insufficient Designs or Execution B. Too Few Studies C. Inconsistent D. Small E. Not Used

Grades of Recommendation Assessment, Development and Evaluation Aim: to develop a common, transparent and sensible system for grading the quality of evidence and the strength of recommendations - Since 2000 - Guideline developers, methodologists & clinicians from around the world CMAJ 2003, BMJ 2004, BMC 2004, BMC 2005, AJRCCM 2006, Chest 2006, BMJ 2008

GRADE Uptake World Health Organization Allergic Rhinitis in Asthma Guidelines (ARIA) American Thoracic Society American College of Physicians European Respiratory Society European Society of Thoracic Surgeons British Medical Journal Infectious Disease Society of America American College of Chest Physicians UpToDate National Institutes of Health and Clinical Excellence (NICE) Scottish Intercollegiate Guideline Network (SIGN) Cochrane Collaboration Infectious Disease Society of America Clinical Evidence Agency for Health Care Research and Quality (AHRQ) Partner of GIN Over 40 major organizations

The GRADE approach Clear separation of 2 issues: 1) 4 categories of quality of evidence: very low, low, moderate, or high quality? methodological quality of evidence likelihood of systematic deviation from truth by outcome 2) Recommendation: 2 grades weak/conditional or strong (for or against)? Quality of evidence only one factor *www.gradeworkinggroup.org

Determinants of quality RCTsstart high observational studies start low 5 factors that can lower quality 1. limitations of detailed design and execution 2. inconsistency 3. indirectness 4. publication bias 5. Imprecision 3 factors can increase quality 1. large magnitude of effect 2. all plausible confounding may be working to reduce the demonstrated effect or increase the effect if no effect was observed 3. dose-response gradient

GRADE - 2004

Evidence Profiles/Summaries

Directness (generalizability, applicability)

Strength of recommendation The strength of a recommendation reflects the extent to which we can, across the range of patients for whom the recommendations are intended, be confident that desirable effects of a management strategy outweigh undesirable effects.

Ebell et al, 2004

USPSTF - Grade Definitions After May 2007: Recommendations Grade Definition Suggestions for Practice A The USPSTF recommends the service. There is high Offer or provide this service. certainty that the net benefit is substantial. B C The USPSTF recommends the service. There is high certainty that the net benefit is moderate or there is moderate certainty that the net benefit is moderate to substantial. The USPSTF recommends against routinely providing the service. There may be considerations that support providing the service in an individual patient. There is at least moderate certainty that the net benefit is small. Offer or provide this service. Offer or provide this service only if other considerations support the offering or providing the service in an individual patient. D The USPSTF recommends against the service. There is moderate or high certainty that the service has no net benefit or that the harms outweigh the benefits. Discourage the use of this service. I The USPSTF concludes that the current evidence is Statement insufficient to assess the balance of benefits and harms of the service. Evidence is lacking, of poor quality, or conflicting, and the balance of benefits and harms cannot be determined. Read the clinical considerations section of USPSTF Recommendation Statement. If the service is offered, patients should understand the uncertainty about the balance of benefits and harms. * The USPSTF defines certainty as "likelihood that the USPSTF assessment of the net benefit of a preventive service is correct.

GRADE Determinants of the strength of a recommendation - Judgments

Avian Influenza judgments about recommendation Factors that can weaken the strength of a recommendation. Example: treatment of H5N1 patients with oseltamivir Lower quality evidence Decision Explanation Yes No The quality of evidence is very low Uncertainty about the balance of benefits versus harms and burdens Yes No Uncertainty or differences in values Yes No The benefits are uncertain because several important or critical outcomes where not measured. However, the potential benefit is very large despite potentially small relative risk reductions. All patients and care providers would accept treatment for H5N1 disease Uncertainty about whether the net benefits are worth the costs Yes No For treatment of sporadic patients the price is not high ($45). Frequent yes answers will increase the likelihood of a weak recommendation

Example: Oseltamivir for Avian Flu Recommendation: In patients with confirmed or strongly suspected infection with avian influenza A (H5N1) virus, clinicians should administer oseltamivir treatment as soon as possible (strong recommendation, very low quality evidence). Values and Preferences Remarks: This recommendation places a high value on the prevention of death in an illness with a high case fatality. It places relatively low values on adverse reactions, the development of resistance and costs of treatment. Schunemann et al., The Lancet ID, 2007

Other explanations Remarks:Despite the lack of controlled treatment data for H5N1, this is a strong recommendation, in part, because there is a lack of known effective alternative pharmacological interventions at this time. The panel voted on whether this recommendation should be strong or weak and there was one abstention and one dissenting vote (13 total).

Implications of a strong recommendation Patients: Most people in this situation would want the recommended course of action and only a small proportion would not Clinicians: Most patients should receive the recommended course of action Policy makers: The recommendation can be adapted as a policy in most situations

Implications of a weak (conditional) recommendation Patients: The majority of people in this situation would want the recommended course of action, but many would not Clinicians: Be more prepared to help patients to make a decision that is consistent with their own values/decision aids and shared decision making Policy makers: There is a need for substantial debate and involvement of stakeholders

Should there be no assessment? There is a very good alternative to using the system to rate clinical guidelines: clinicians and organizations should use published guidelines while considering the clinical context, the credentials, and any conflicts of interest among the authors, as well as the expertise, experience, and education of the practitioner. Kavanagh, 2009

Should there be no appraisal? Users of recommendations want to know Quality of evidence, recommendation Guideline panels have more resources, time and expertise than individual practitioner Just present the evidence?

Summary: Limitations of older systems Lack well-articulated conceptual framework Criteria not comprehensive or transparent Focus on single outcomes New systems: explicit evaluation of the importance of all important outcomes Confuse quality of evidence with strength of recommendations Transparent criteria of moving from evidence to recommendations New systems strengths: Group of international guideline developers Explicit acknowledgment of values and preferences Clear, pragmatic interpretation of strong versus conditional/weak recommendations for clinicians, patients, and policy makers Useful for SR and HTA, as well as guidelines

Conclusions Evidence appraisal systems Should provide a systematic framework General structure similar to initial approach Detailed assessment criteria better described Overlap between sophisticated systems Required because research methods complex Expert opinion evidence: interpretation Detailed guidance for developers and users needed Strength of recommendations Balance of benefits/harms, values, resource use? Can evidence be insufficient?