Russell Harris*, George F. Sawaya, Virginia A. Moyer, and Ned Calonge

Similar documents
10.2 Summary of the Votes and Considerations for Policy

Preventive Medicine 2009: Understanding the US Preventive Services Task Force Guidelines. *George F. Sawaya, MD

Screening for Prostate Cancer US Preventive Services Task Force Recommendation Statement

GENERAL COMMENTS. The Task Force Process Should be Fully Open, Balanced and Transparent

Let s look a minute at the evidence supporting current cancer screening recommendations.

Introduction to Cost-Effectiveness Analysis

ISPUB.COM. Health screening: is it always worth doing? O Durojaiye BACKGROUND SCREENING PROGRAMMES SCREENING OUTCOMES VALIDITY OF SCREENING PROGRAMMES

Shared Decision Making in Breast and Prostate Cancer Screening. An Update and a Patient-Centered Approach. Sharon K. Hull, MD, MPH July, 2017

Genomics in Public Health Vision and Goals for the Population Screening Working Group

Nicolaus Copernicus University in Torun Medical College in Bydgoszcz Family Doctor Department CANCER PREVENTION IN GENERAL PRACTICE

Welcome to this four part series focused on epidemiologic and biostatistical methods related to disease screening. In this first segment, we will

PREVENTIVE HEALTHCARE GUIDELINES INTRODUCTION

Clinical Policy Title: Abdominal aortic aneurysm screening

Untangling the Confusion: Multiple Breast Cancer Screening Guidelines and the Ones We Should Follow

Clinical Policy Title: Abdominal aortic aneurysm screening

THE LIKELY IMPACT OF EARLIER DIAGNOSIS OF CANCER ON COSTS AND BENEFITS TO THE NHS

Genetic Disease, Genetic Testing and the Clinician

Evidence-based Cancer Screening & Surveillance

Should I Get a Mammogram?

The U.S. Preventive Services Task Force (USPSTF) CLINICAL GUIDELINE

provision of health screening to employees in the context of the new employment equality (age) regulations 2006

Clinical Policy Title: Abdominal aortic aneurysm screening

The Impact of HPV Vaccination and the Future of Cervical Cancer Screening

Preventive Health Guidelines

The Debate: Is screening s effect on mortality significant? Cancer incidence/death/ gender US

The Guide to Clinical Preventive Services Recommendations of the U.S. Preventive Services Task Force

How often should I get a mammogram?

U.S. Preventive Services Task Force: Draft Prostate Cancer Screening Recommendation (April 2017)

Medical Screening: Ethical Considerations Professor Jim Malone, Trinity College Dublin, Ireland

Guidelines for Colonoscopy Surveillance After Screening and Polypectomy: A Consensus Update by the US Multi-Society Task Force on Colorectal Cancer

Introduction to Epidemiology Screening for diseases

Guidelines in Breast Screening Mammography: Pros and Cons JOSLYN ALBRIGHT, MD SURGICAL ONCOLOGIST, ADVOCATE CHRIST MEDICAL CENTER OCTOBER 1, 2016

Clinical Guidelines and Recommendations from the American College of Physicians: Their Role in Improving Health Care Value and Reducing Overdiagnosis

Screening for cancer in nursing home patients: Almost always a bad idea

Executive summary. The usefulness of breast cancer screening

Summary HTA. HTA-Report Summary

Pap Smears Pelvic Examinations Well Woman Examinations. When should you have them performed???

NATIONAL GUIDELINE CLEARINGHOUSE (NGC) GUIDELINE SYNTHESIS SCREENING FOR BREAST CANCER

A senior s guide for preventative healthcare services Ynolde F. Smith D.O.

WHO Perspective on Cancer Screening

Making Economic Evaluation Fit for Purpose to Guide Resource Allocation Decisions

CRITICAL APPRAISAL OF CLINICAL PRACTICE GUIDELINE (CPG)

Variation of Benefits and Harms of Breast Cancer Screening With Age

Author(s) : Title: HERCA WG Medical Applications / Sub WG Exposure of Asymptomatic Individuals in Health Care

Washington, DC, November 9, 2009 Institute of Medicine

How Does Analysis of Competing Hypotheses (ACH) Improve Intelligence Analysis?

Statement of Coverage. Preventive Health Services Policy. Policy Specific Section: Preventive Health Guidelines

Evaluation Models STUDIES OF DIAGNOSTIC EFFICIENCY

Breast Cancer Screening Clinical Practice Guideline. Kaiser Permanente National Breast Cancer Screening Guideline Development Team

Clinical Practice Guidelines Adult Preventive Health

SBI Breast Imaging Symposium 2016 Austin Texas, April 7, 2016

Cost benefit analysis of computer-based patient records with regard to their use in colon cancer screening

CYSTIC FIBROSIS. The condition:

OVERDIAGNOSIS: Making People Sick in the Pursuit of Health

Screening for Disease

Benefit/Risk Assessment A Critical Need

Guidelines for Breast, Cervical and Colorectal Cancer Screening

Prostate Cancer Screening Where are we? Prof. Bob Steele Professor of Surgery, University of Dundee Independent Chair, UK NSC

What Constitutes a Good Contribution to the Literature (Body of Knowledge)?

Prostate Cancer. Biomedical Engineering for Global Health. Lecture Fourteen. Early Detection. Prostate Cancer: Statistics

Examine breast cancer trends, statistics, and death rates, and impact of screenings. Discuss benefits and risks of screening

Cost-effectiveness of evolocumab (Repatha ) for hypercholesterolemia

INTERNATIONAL STANDARD ON ASSURANCE ENGAGEMENTS 3000 ASSURANCE ENGAGEMENTS OTHER THAN AUDITS OR REVIEWS OF HISTORICAL FINANCIAL INFORMATION CONTENTS

Effective Health Care Program

Technology appraisal guidance Published: 6 December 2017 nice.org.uk/guidance/ta493

Goals of Screening Programs. What is Vascular Screening? Assumptions Regarding the Potential Benefits of Screening Programs PAD

SCREENING FOR OVARIAN CANCER: A SYSTEMATIC REVIEW 1998

Understanding How the U.S. Preventive Services Task Force Works USPSTF 101

Preventive care guidelines Blue Cross and Blue Shield of Minnesota

Good Practice Notes on School Admission Appeals

providers in these settings face in adhering to guidelines for evaluating patients for TB. We identified diagnostic guideline adherence and

Incorporating qualitative research into guideline development: the way forward

NATIONAL VOLUNTARY CONSENSUS STANDARDS: PALLIATIVE CARE AND END-OF-LIFE CARE A CONSENSUS REPORT FINAL REPORT

Role of CT in Lung Cancer Screening: 2010 Stuart S. Sagel, M.D.

Outcomes With "Watchful Waiting" in Prostate Cancer in US Now So Good, Active Treatment May Not Be Better

Discussion. Re C (An Adult) 1994

The U.S. Preventive Services Task Force (USPSTF) makes

The Guidelines Guide: Routine Adult Screening Created March 2009 by Alana Benjamin, MD Last updated: June 29 th, 2010

Re: Inhaled insulin for the treatment of type 1 and type 2 diabetes comments on the Assessment Report for the above appraisal

Questions and Answers about Prostate Cancer Screening with the Prostate-Specific Antigen Test

July 7, Dockets Management Branch (HFA-305) Food and Drug Administration 5630 Fishers Lane, Rm Rockville, MD 20852

US Preventive Services Task Force : Who we are, What we do, and How we hope to work with you

Quality of Life After Modern Treatment Options for Prostate Cancer Ronald Chen, MD, MPH

LDCT Screening. Steven Kirtland, MD. Virginia Mason Medical Center February 27, 2015

Essential Skills for Evidence-based Practice Understanding and Using Systematic Reviews

HIV SCREENING WORKSHOP Exercise

Cancer Screening I have no conflicts of interest. Principles of screening. Cancer in the World Page 1. Letting Evidence Be Our Guide

Access to newly licensed medicines. Scottish Medicines Consortium

Protection. Cancer Cover. Helping to protect you financially

Screening for Prostate Cancer with the Prostate Specific Antigen (PSA) Test: Recommendations 2014

Reflection paper on assessment of cardiovascular safety profile of medicinal products

4.5% 4.0% 3.5% 3.0% 2.5% 2.0% 1.5% 1.0% 0.5% 0.0% < >80 Current Age (Yrs)

Purpose. Study Designs. Objectives. Observational Studies. Analytic Studies

CT Lung Screening Implementation Challenges: State Based Initiatives

Running Head: NARRATIVE COHERENCE AND FIDELITY 1

BACKGROUND + GENERAL COMMENTS

Breast cancer reconstruction surgery (immediate and delayed) across Ontario: Patient indications and appropriate surgical options

Otis W. Brawley, MD, MACP, FASCO, FACE

Summary HTA. The role of Homocysteine as a predictor for coronary heart disease. Lühmann D, Schramm S, Raspe H. HTA-Report Summary

Updates In Cancer Screening: Navigating a Changing Landscape

Transcription:

Epidemiologic Reviews ª The Author 2011. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of Public Health. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com. Vol. 33, 2011 DOI: 10.1093/epirev/mxr005 Advance Access publication: June 10, 2011 Reconsidering the Criteria for Evaluating Proposed Screening Programs: Reflections From 4 Current and Former Members of the U.S. Preventive Services Task Force Russell Harris*, George F. Sawaya, Virginia A. Moyer, and Ned Calonge * Correspondence to Dr. Russell Harris, Cecil G. Sheps Center for Health Services Research, 725 Martin Luther King, Campus Box 7590, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599-7590 (e-mail:russell_harris@med.unc.edu). Accepted for publication February 24, 2011. In 1968, Wilson and Jungner published 10 principles for evaluating screening programs (Public Health Papers No. 34. Geneva, Switzerland: World Health Organization), criteria widely used since then. The 4 authors of this review (all current or former members of the U.S. Preventive Services Task Force) have found a different paradigm more useful for evaluating screening programs. This review was written independently of the USPSTF; the authors speak only for themselves and not for the USPSTF. They suggest evaluating screening programs not as a checklist but as a balance between the magnitude of benefits and the magnitude of harms, each estimated from a systematic review of the evidence. To emphasize a focus on health outcomes, the authors suggest reframing the target of screening as an umbrella concept: the predictor of poor health. Evaluation groups should weigh health benefits and harms to estimate net benefits and then consider whether these net benefits justify the resources required. The final decision about implementation should be made by a democratic process that considers both the panel s evaluation of the evidence and nonevidence factors (e.g., resources available, other priorities, the informed population s preferences). The authors hope these suggestions stimulate further discussion about the optimal way to evaluate proposed screening programs. evaluation studies as topic; evidence-based practice; health planning guidelines; mass screening; outcome assessment (health care); quality of health care; review; risk assessment Abbreviations: NNNb, number needed to screen for benefit; NNNh, number needed to screen for harm; PPH, predictor of poor health; QALY, quality-adjusted life-year; USPSTF, U.S. Preventive Services Task Force. The idea of screening is at least 150 years old; its popularity has grown over time (1, 2). Targeted screening for tuberculosis and syphilis started early in the 20th century. Medical leaders promoted the idea of the periodic health examination (largely an exercise in screening) beginning in the 1920s (1). Screening for multiple conditions increased after World War II, with the finding that many young recruits had conditions that disqualified them from military service. In the 1950s, multiphasic screening programs increased, with gradual public and medical acceptance (2). Initially, the increased interest in screening was driven by advances in testing, apparently without much consideration of the health outcomes. Over the years, many observers have commented on the seductive appeal of early detection, and more and more screening tests and programs have been proposed. Many have been implemented. In the 1960s and 1970s, some groups began to question the wisdom of widespread, unexamined screening. The science of clinical epidemiology was beginning, and evidence-based groups began to ask difficult questions. As an expression of this more critical approach, some proposed criteria by which to evaluate screening programs. In 1961, the US Public Health Service (3) published a monograph, Principles and Procedures in the Evaluation of Screening for Disease, and, in 1968, the World Health Organization published a monograph by Wilson and Jungner (4), The Principles and Practice of Screening for Disease. The Wilson and Jungner monograph proposed 10 principles for evaluation of screening programs (Table 1). These principles have been very influential ever since, as has been recently shown by a systematic review of screening criteria proposed since publication of the Wilson and Jungner report (5). The authors of this review, Andermann et al. (5), 20

Evaluating Proposed Screening Programs 21 Table 1. Summary of Wilson and Jungner Criteria (4) Principle The condition sought should be an important health problem. There should be an accepted treatment for patients with recognized disease. Facilities for diagnosis and treatment should be available. There should be a recognizable latent or early symptomatic stage. There should be a suitable test or examination. The test should be acceptable to the population. The natural history of the condition, including development from latent to declared disease, should be adequately understood. There should be an agreed policy on whom to treat as patients. The cost of case finding (including diagnosis) should be economically balanced in relation to possible expenditure on medical care as a whole. Case finding should be a continuing process and not a once and for all project. suggested refinements to the Wilson and Jungner system, considering especially their relevance to genetic screening. In North America, organized and systematic examination of the evidence for screening occurred initially in the 1970s. Especially influential were individuals and groups such as Frame and Carlson (6), Friedman et al. (7), the Canadian Task Force on the Periodic Health Examination (8), and, a few years later, the U.S. Preventive Services Task Force (USPSTF) (9). Since 1984, the USPSTF has evaluated multiple prevention services, making multiple recommendations about whether to implement screening programs (10). The USPSTF has also had lengthy discussions about evaluation of screening programs and has published several articles on its methods (11 14). We 4 authors of this review (all have had extensive experience on the USPSTF) have been deeply involved in these discussions, and we have noted that neither the Wilson and Jungner criteria (4) nor the Andermann et al. revisions (5) fully capture the evaluation methods we have found most useful in our USPSTF experience. We therefore reviewed our experience with screening recommendations from the USPSTF (extending from 1997 to 2011), supplemented by the Andermann et al. review of proposed criteria since Wilson and Jungner s monograph and by an updated search Further Explanation by Wilson and Jungner Does not depend on prevalence only; must consider from the point of view of the individual and community; conditions with serious consequences for either individuals or the community may both justify screening. Perhaps most important criterion; unless there is an effective treatment, actual harm may be done; requires answering 2 questions: 1) Does treatment at the presymptomatic borderline stage of a disease affect its course and prognosis? 2) Does treatment of the developed clinical condition at an earlier stage than normal affect its course and prognosis? If the answer to question 1 is not clearly yes, then there is no case for screening. For question 2, effective treatment is usually assumed. Must have facilities available for the diagnosis and treatment of people found positive by screening. Must be a reasonable asymptomatic period in the natural history of the condition. Test must be easy and quick, may be less sensitive and specific than a diagnostic test. In a screening test, one may accept a higher false-positive rate, but a high false-negative rate would not be acceptable. Acceptability is related to the nature of the risk involved and the extent to which the ground is prepared previously by health education. It is necessary to have conducted enough research to know 1) What changes should be regarded as pathologic and what should be considered physiologic variations? and 2) Are early pathologic changes progressive? It is necessary to know, Is there an effective treatment that can be shown either to halt or to reverse the early pathologic changes? We may not know the answer to this question because randomized controlled trials of screening or treatment have not been conducted. We must be careful to heed the Hippocratic principle of primum non nocere. There is a borderline problem whereby people are found by screening who are neither clearly normal nor abnormal. It is important to have a clear policy for either treatment or follow-up of these people. There are 2 general aims of screening: to improve health and to reduce costs. It is not certain that screening will reduce costs; there is a need for randomized controlled trials of screening to determine this, although these trials are difficult to conduct. The benefit of single-occasion screening is limited. for other criteria published since Andermann et al. s review. Using these sources, we suggest a revised approach to evaluation of proposed screening programs. We would like to envision our revised approach as the stimulus to wider discussion of these issues, hopefully culminating in an agreedon revision of the Wilson and Jungner approach. Although all 4 of us are current or past members of the USPSTF, this review was written independently of the USPSTF; we do not speak for the USPSTF or the Agency for Healthcare Research and Quality (which convenes the USPSTF). We have been influenced by our work on the USPSTF, but the views and proposals herein are our own. MATERIALS AND METHODS First, we define our topic more precisely by noting that 2 overlapping issues are involved in considering screening programs: 1. If implemented under present conditions, would the screening program result in sufficient net benefit for the population to justify starting (or continuing) the program, given the level of resources required? 2. If the answer to the first question is yes, what needs to be done to optimize implementation of the program?

22 Harris et al. The USPSTF focuses primarily on the first of these issues, as do most of the Wilson and Jungner criteria (4). The Andermann et al. review (5) modified the Wilson and Jungner criteria list, at least partly, by adding considerations that focus more on the second question. Our focus is on the first question, although we also recognize the importance of the implementation issue. Second, we take as given that any screening program must first define its purpose in terms of the population it intends to screen and what health outcomes it seeks to improve. Our paper deals with analyzing potential (or ongoing) screening situations that have defined their purposes to at least this degree. We sought to identify potential approaches to analyzing the issue of whether a screening program should be implemented. We accepted 3 types of evidence for this review: 1. the original Wilson and Jungner criteria and monograph (4); 2. the suggested criteria from the systematic review by Andermann et al. (5) (which ended its search in 2004) and from an additional supplementary search to identify relevant criteria suggested in the literature since 2004; and 3. the extensive USPSTF experience of all 4 of us authors. We reviewed the original Wilson and Jungner monograph in detail, and we contacted Professor Andermann to obtain the specific search terms and results from her work. We then modified these terms to search for papers published between 2004 and July 27, 2010, that provided suggested criteria for the first question above that had not been previously identified by Andermann et al. Our search strategy and results are shown in Appendix Table 1. Our modifications to the Andermann et al. search primarily broadened the search beyond restrictions on genetic testing. The Andermann et al. review was most interested in genetic screening, but it used search terms broad enough to capture suggested criteria from nongenetic groups as well. All papers from our search were reviewed by the first author (R. H.), who discussed potential additional criteria with the other 3 authors. Finally, we reviewed previous relevant USPSTF methods papers and the USPSTF Procedure Manual (11 14), as well as the documents from all current USPSTF screening topics, many of which had been updated from previous reviews conducted by the USPSTF over its 25-year history (Web Appendix 1, which is posted on the Epidemiologic Reviews Web site: www.epirev.oxfordjournals.org). When necessary, we reviewed older USPSTF documents as well as the most current ones. Using these 3 data sources, we synthesized the various criteria and approaches into a suggested approach by a series of e-mails and telephone calls. All authors worked on the paper, including approving the final version. There was no external funding for this project. RESULTS: DATA SOURCES Wilson and Jungner criteria (summary, Table 1) The Wilson and Jungner monograph (4) was written at a time very different from the present. Commissioned by the World Health Organization, its primary purpose was to stimulate the field of screening, as explained in its preface: In developed countries, therefore, it would seem that the practice of screening for disease should be widespread. That it is not so to the extent that might be expected is due to a number of factors, among them the cost of screening and the tendency in the medical profession to wait for patients rather than actively to look for disease in the population. Another factor undoubtedly is inadequate knowledge of the principles and practice of screening for disease (4, p. 7). The monograph offers an in-depth treatment of the field of screening at the time (4). Among other issues, it sets out in summary form 10 principles that could help people recognize opportunities for screening. After the summary, however, Wilson and Jungner discuss each principle in more detail, demonstrating their understanding that there are harms as well as benefits to screening (Table 1). They did not, however, conceive of the evaluation of screening programs as a balance involving weighing the magnitude of health benefits and health harms. We agree considerably with the Wilson and Jungner principles, yet we conclude that they are not framed in an optimal way to ensure a careful and complete evaluation of screening programs today. Andermann et al. literature search and additional search by the authors We reviewed the 53 sets of criteria found in the Andermann et al. (5) search and the revisions suggested by these authors. As they note, the majority of these criteria overlap the classic Wilson and Jungner criteria (5, p. 318). Some of the revisions suggested by Andermann et al. reflect an interest in question 2 above: ways to optimize implementation of the screening program. Our focus, however, was on question 1: estimating the magnitude of net benefit and making a recommendation about implementation. Our supplementary search found 517 additional citations (Appendix Table 1). However, our review found none of these proposed previously unconsidered criteria relevant to question 1. Current USPSTF screening recommendations We spent time considering the evidence and approach to recommendations we encountered in our experience on the USPSTF and the several relevant methodological papers that have been written by the USPSTF (10 14). The current screening recommendations of the USPSTF are given in Web Appendix 1. RESULTS: SYNTHESIS Focusing on health outcomes and the screening balance: assessing benefits and harms (summary, Table 2) Our experience with multiple screening topics has taught us to focus on health outcomes rather than diseases or intermediate outcomes. The purpose of screening is to

Evaluating Proposed Screening Programs 23 Table 2. Summary of Considerations for Estimating Benefits and Harms of a Screening Program Consideration improve the length and/or quality of people s lives, not just to find abnormalities. In this framework, we are not screening for diseases but instead risk factors for adverse health outcomes. We find disease a vague and unhelpful term. The term risk factor, however, also has its various connotations and is used in several different ways in different contexts. To more precisely specify the appropriate target of screening, and to emphasize the need to focus on health outcomes, we suggest the term predictor of poor health (PPH). This inclusive term includes abnormal screening test results as well as abnormalities found after workup of an abnormal screening test. Thus, diseases (e.g., asymptomatic invasive breast cancer or asymptomatic diabetes), prediseases (e.g., ductal carcinoma in situ or prediabetes), and risk factors (e.g., dyslipidemia) are all PPHs. Use of the umbrella term PPH enables us to consider the strength of various proposed screening targets. By strength, we mean the degree to which the proposed screening target predicts adverse health outcomes. PPHs should be considered along a continuum of absolute probability for an adverse health outcome, but we can think of them generally as weak to moderate to strong predictors. For example, some prostate cancers qualify as weak Comments Magnitude of Potential Benefits Probability of an adverse health outcome without screening It is important to define and focus on the adverse health outcome that the screening program is attempting to reduce. One must also define the specific population that the program intends to screen. Degree to which screening identifies all people who would suffer the adverse health outcome Magnitude of incremental health benefit of earlier versus later treatment resulting from screening The proper target of screening is to detect those people who will suffer the adverse health outcome. Detecting (labeling) people who will not suffer the adverse health outcome is not a benefit. For a screening program to reduce morbidity or mortality from an adverse health outcome, earlier treatment (after screening detection) must provide more health benefit than later treatment (after clinical detection). It is important to estimate this incremental benefit in absolute terms. Magnitude of Potential Harms Frequency of false-positive screening tests The best estimate of the frequency of false-positive screening tests is the cumulative percentage of people screened who have at least one false-positive screening test over a period of time, such as 10 years. Experience of people with false-positive results A negative experience of people with a false-positive screening test may come from either physical (e.g., risk of complications from a colonoscopy after a positive fecal occult blood test) or psychological (e.g., short-term or long-term anxiety after a false-positive test) causes. Small or infrequent negative experiences, when experienced by many people, may add up to large harms for a population. Frequency of overdiagnosis The critical issue in defining overdiagnosis is whether earlier diagnosis (due to screening) compared with later diagnosis (due to clinical detection) leads to increased labeling, diagnostic evaluation, or treatment that has potential adverse effects on health. Experience of people who are overdiagnosed It is important to estimate, for the screened population, the absolute frequency and severity of the adverse effect on health due to increased labeling, diagnostic evaluation, or treatment resulting from earlier diagnosis. Frequency and severity of harms of workup and treatment Helps determine the harms of overdiagnosis (2b) a. For people found to have conditions that would lead to the adverse health outcome, harms from earlier workup or treatment means they will suffer these harms for a longer time, thus reducing net benefits. a Refer to the text for more information about consideration 2b. PPHs; that is, they are associated with a low absolute probability of adverse health outcomes due to prostate cancer. Weak PPHs have little value as screening targets; some weak PPHs may indicate such a low probability that no further screening is needed. On the other hand, some mammographic findings are stronger PPHs in that they are associated with a higher absolute probability of an adverse health outcome due to breast cancer. When unrecognized, some strong PPHs are actually adverse health outcomes in themselves as with visual or hearing impairment, or depression. Depending on whether detection leads to more effective treatment (see below), strong PPHs may be excellent screening targets. The strength of a PPH depends on its ability to predict an adverse health outcome. This prediction may occur in 1 of 2 ways. The first is as a statistical predictor without being a causal intermediate. For example, a woman s score on the Gail et al. (15) model (a PPH found by a series of screening questions) gives a probability of developing breast cancer within the coming 5 years. One of the questions that contributes to the Gail et al. model score is whether the woman has ever had a breast biopsy. Although having had a breast biopsy does not actually cause breast cancer, it is statistically associated with being diagnosed with breast cancer.

24 Harris et al. Unfortunately, even a high Gail et al. model score is a weak PPH in that it predicts only a marginally higher absolute risk than a lower score does. The second way a PPH may predict an adverse health outcome is by being a part of the chain of events that progresses to the adverse health outcome an intermediate PPH. For example, asymptomatic invasive breast cancer sometimes progresses to symptomatic breast cancer, which often progresses further to severe adverse health states or even death. The strength of intermediate PPHs is determined by their rate of progression to the adverse health outcome. The rate of progression can be difficult to determine; many intermediate PPHs are heterogeneous, with variation in progression. That is, some members of an intermediate PPH class may progress rapidly, others may progress slowly, and still others may not progress at all, or even regress. For example, some people with impaired glucose tolerance (an intermediate PPH) never progress to symptomatic diabetes, some progress only after several years, and some progress more rapidly (16). As noted earlier, when the progression rate is uncertain, as in the case with ductal carcinoma in situ of the breast (17), the strength of the PPH is also uncertain. In many cases, we do not have a strong PPH; we have yet to identify a PPH for ovarian cancer that is strong enough to merit screening (18). In other cases, we have confused strong PPHs with similar, but much weaker PPHs. For example, an abdominal aortic aneurysm with a diameter of 3.0 3.9 cm is a weak PPH, whereas one 5.0 cm or larger is a strong PPH (19). Yet, we have used the term abdominal aortic aneurysm to categorize both. Similarly, colonic polyps 1.0 cm or larger are stronger PPHs than polyps of less than 0.5 cm (20). However, the usual practice is to remove all polyps, regardless of size. For some abnormalities detected by screening, the degree to which they predict an adverse health outcome is uncertain: they are uncertain PPHs. For example, the strength of ductal carcinoma in situ or prediabetes as a PPH is uncertain because of our uncertainty about the degree to which they predict adverse health outcomes from breast cancer or diabetes, respectively. Until we can determine their strength, uncertain PPHs make poor screening targets. Treating weak or uncertain PPHs the same way we treat strong PPHs may cause more harm than benefit. Risk stratification tools may be thought of as screening tests, and the risk categories into which people are classified are thus PPHs. Sometimes, the PPH categories are called high when, in absolute terms, they are low. Thus, the Gail et al. (15) model labels a woman with a 5-year probability of breast cancer of 1.67% as being high risk; in absolute terms, this is a weak PPH. To detect a strong PPH, the stratification tool or screening test must discriminate subgroups with a high absolute probability of developing an adverse health outcome. Ideally, a screening test would be able to accurately assign subgroups to one end of the probability continuum or the other, that is, attain complete separation between subgroups that will or will not suffer the adverse health outcome. In this ideal scenario, people with a high absolute probability would be advised to undergo further testing, while people at the lower end of the continuum could stop testing and be reassured. This is essentially the aspiration of those who advocate personalized medicine (21). Unfortunately, few if any screening tests or stratification tools have yet attained this degree of separation between subgroups. Throughout this review, we use the term PPH as the target of screening. We summarize some aspects of PPHs in Table 3. After an abnormal screening test is detected (itself a PPH), we usually go through a workup that may or may not detect a further, potentially stronger PPH. If one is detected, we may modify or treat this PPH to reduce the likelihood of adverse health events. If earlier treatment (at screening diagnosis) reduces the frequency or severity of adverse health outcomes more than later treatment (at clinical diagnosis) would, then the difference in health produced by earlier compared with later treatment is the magnitude of health benefit from the screening program. Unfortunately, every screening program also has the potential to do harm. Thus, in addition to examining evidence to estimate the magnitude of potential benefit from the program, one must also examine the evidence to estimate the magnitude of harm that may be produced by the program. Weighing the balance of this trade-off between benefits and harms (both measured in health terms) is a central activity in evaluating screening programs. We depict this as a balance between benefits and harms, with a further determination about whether the magnitude of net benefits justifies the resources required by the program (Figure 1). As shown in Figure 1, the magnitude of health benefits is weighed against the magnitude of health harms. Evaluators then determine whether the magnitude of net benefit (i.e., the extent to which benefits do or do not outweigh harms) is worth the resources required by the program. Implementation should be considered for screening programs that provide net benefits of reasonable magnitude at a reasonable resource input, although other factors such as public priorities, availability of needed resources, and competing programs may delay or prevent actual implementation (Figure 1). Ultimately, compared with the checklist approach suggested by Wilson and Jungner (4), this screening balance approach has been a more useful guide for us in evaluating proposed screening programs. Although the checklist has served some as a quick way to justify starting or extending screening, the complexity of screening requires that we examine the interplay of the factors involved in benefits, harms, and resource use. We readily admit that this formulation does not provide a quantitative way to assess screening programs. Judgment about the certainty of evidence and about the magnitude and trade-offs of benefits and harms is still required. We do suggest, however, that the screening balance, if implemented in an unbiased and transparent manner, does provide a standard approach that could well be useful to recommendation groups or individuals critically appraising these recommendations. As shown in Figure 1, evaluation of a proposed screening program depends on evidence for 3 questions, plus the preferences of a properly informed population to be screened. The 3 questions involve 1) the magnitude of health benefits,

Evaluating Proposed Screening Programs 25 Table 3. Aspects of the PPH Definition of PPH: The target of a screening test, to be considered along a continuum of the probability it confers for the adverse health outcome of interest. PPH is an inclusive term that includes d Abnormal screening tests d Preconditions d Diseases d Risk factors Strength of a PPH: The degree to which it predicts the adverse health outcome within a defined period of time. Strength is continuous but may be considered in the following general categories: d Strong: represents a high absolute probability of the adverse health outcome within a given time period d Moderate: represents a moderate absolute probability of the adverse health outcome within a given time period d Weak: represents a low absolute probability of the adverse health outcome within a given time period d Uncertain: the probability of people with this PPH experiencing the adverse health outcome is uncertain Importance of the PPH idea d It moves away from categorical thinking, which is vague and may be influenced by changing definitions of disease or risk factors. d It emphasizes what we know and do not know about the probability of the adverse health outcome after a positive screening test and/or workup. d It asks us to be clear about what we do not know about the entities detected by screening, allowing for uncertain PPHs. d It allows for heterogeneous PPHs, yet also allows for changing a PPH from uncertain to another general category, depending on future research. d It helps guide evaluation because weak and uncertain PPHs are generally poor targets for screening. d Strong PPHs do not determine the balance between benefits and harms of a screening program, but strong PPHs do reduce the harms associated with false positives and overdiagnosis. d Both strong and moderate PPHs may be good targets for screening programs, but it depends on other factors such as the burden of suffering of the population from the adverse health outcome before screening, the incremental benefit of early versus later treatment, and the balance between benefits and harms overall. Abbreviation: PPH, predictor of poor health. The Balance Approach Magnitude of Benefits?? Magnitude of Harms?? Note: Resource use is considered after net benefits are established. Figure 1. The balance approach to evaluating proposed screening programs. 2) the magnitude of health harms, and 3) the resources required to implement and administer the program. Magnitude of health benefits (summary, Table 2). The most important factors on the benefit side of the screening balance are 1) the probability of the adverse health outcome (e.g., death from breast cancer or myocardial infarction attributable to diabetes) in the population with no screening; 2) the degree to which the PPH identifies all people who would suffer the adverse health outcome; and 3) the incremental benefit of earlier (at screening detection) versus later (at clinical detection) treatment resulting from detection of the PPH (e.g., asymptomatic breast cancer or diabetes) in reducing the adverse health outcomes in factor 1. The probability of the adverse health outcome without screening. A screening program should begin with the goal of reducing an adverse health outcome in a specific population. The benefit of screening for the population cannot exceed the burden of suffering caused by the potentially preventable adverse health outcome. For example, although the goal of screening for carotid artery stenosis is to reduce the incidence of stroke, only about 10% of strokes are attributable to carotid artery stenosis (22). Thus, the maximum possible benefit of screening would be to reduce stroke incidence by 10%. The degree to which screening identifies all people who would suffer the adverse health outcome. This factor is similar to, but not the same as, the commonly used sensitivity of a screening test. Whereas the usual sensitivity term is measured by assessing the degree to which the screening test leads to identifying everyone with disease or predisease, our concept of sensitivity assesses the degree to which screening detects that subgroup of people who would suffer the adverse health outcome. These are the people who

26 Harris et al. could possibly be helped by earlier detection. The benefit of screening cannot exceed the extent to which it finds PPHs that lead to detection of all those who would suffer the adverse health outcome. If screening finds only 50% of the people who would suffer the adverse health outcome, then its benefit for the population cannot exceed a 50% reduction in the outcome, even if early treatment were completely effective. Detecting (labeling) other people beyond those who would suffer the adverse health outcome can produce only harm (see below). Incremental benefit of earlier versus later treatment. The third important factor that determines the benefit side of the screening balance is the incremental benefit of earlier (after screening detection) versus later (after clinical detection) treatment resulting from detection of the PPH. The importance of the incremental benefit of earlier versus later treatment has not been emphasized adequately in previous screening criteria, yet the benefit side of the screening equation is driven by this incremental effect. If there is no effect of treatment regardless of when the PPH is recognized, then screening cannot be justified. If there is extremely effective treatment at later, clinical detection, then it may be difficult to show an incremental benefit of earlier treatment. For this reason, current effective screening programs will need to be reevaluated if treatment becomes more effective. For example, the benefits of breast cancer screening may be less today than when screening was first introduced, partly because of improved breast cancer treatment (23, 24). If the PPH is strong and the incremental effectiveness of earlier versus later treatment is large, the problems of a low PPH prevalence may be overcome. Such is the case with screening for phenylketonuria (25). In this situation, a positive screening test is highly predictive of an infant developing phenylketonuria, an uncommon but potentially devastating disease amenable to early treatment. Some other proposed screening programs for newborns are more complex; adequately assessing the benefits and harms in these situations can be challenging (26). Many hope that, in the future, genetic testing will be able to detect extremely strong PPHs, thus justifying screening general populations. To date, however, even the best genetic testing situations have identified only a small proportion of the people who will suffer the adverse health outcome, and have thus offered little benefit to society (27). Some argue that knowledge alone is a benefit of screening. Examples of screening programs based on this idea may include screening older adults to detect strong PPHs for progressive dementia (28) or screening newborns for strong PPHs for severe developmental abnormalities (26). In some cases (e.g., progressive dementia), neither strong PPHs nor effective treatments exist today. When there are strong PPHs but no effective treatment exists, screening could theoretically be justified to facilitate future planning or to reduce the likelihood of a diagnostic odyssey, in which early symptoms of health problems provoke extensive diagnostic evaluation over a prolonged period of time before a diagnosis is eventually reached. Because these diagnostic misadventures lead to high anxiety or complications, providing accurate information may theoretically reduce these harms, even though effective treatment is not available. However, it is important to closely examine claims of benefits from knowledge alone in at least 2 ways. The first is by making sure that the PPH that is the target of screening is strong that it is a strong predictor of an adverse health outcome. Screening for weak or uncertain PPHs could well cause more harm than good because the program may be providing more misinformation than information. Most tests, for example, have less than 100% specificity, thus falsely labeling many people (especially in low prevalence populations) as possibly having a feared condition when in fact they do not. The screening could provoke a diagnostic odyssey rather than prevent one. This harm (see below) could potentially cause more anxiety than detecting the PPH could relieve. The second way that claims of benefits from knowledge should be examined lies in the documentation of reduced anxiety or diagnostic misadventures from screening. Systematic studies rather than anecdotes or theoretical situations are needed to determine the magnitude of claimed benefits (as well as actual harms) from screening. Magnitude of health harms (summary, Table 2). There are 3 important factors on the harms side of the screening balance, 2 of which have 2 parts. The first 2 are 1a) the frequency of false-positive screening tests (determined by the prevalence of the PPH sought in the screened population plus the specificity of the testing strategy) and 1b) the experience of people with false-positive tests; and 2a) the frequency of overdiagnosis (defined below) and 2b) the experience of overdiagnosed people. By experience, we mean the value of the health lost due to the false-positive or overdiagnosed health state (e.g., anxiety and complications of labeling, diagnostic evaluation, and increased surveillance for people with false-positive tests; or anxiety and complications from the diagnostic evaluation, increased surveillance, and unneeded treatment for overdiagnosis) integrated with the duration of the harm. Factor 3 on the harms side of the screening balance is the magnitude (e.g., frequency and severity) of harms inherent in the testing, workup, and treatment of the condition. 1a) Frequency of false-positive screening tests. Harms of false-positive screening tests are determined by their frequency, their immediate effect on the person s health (often in terms of anxiety), and the duration of this effect. Because screening tests are usually conducted periodically, the most relevant measure of frequency is the cumulative percentage of screened people having at least 1 false-positive test over some extended period of time (e.g., 10 years), including several screening tests. This percentage is usually considerably greater than the percentage of people who have a falsepositive test on a single screen. The frequency of false-positive screening tests is determined by both the prevalence of the PPH in the screened population and the specificity of the screening strategy. Because screening targets PPHs with lower and lower prevalence, the frequency of false-positive test results increases. The specificity of the screening strategy depends on the threshold (cutpoint) used by the screening strategy to define a positive test. Lower thresholds define more screening tests as positive, and more screened people must then undergo further diagnostic evaluation. The specificity of screening

Evaluating Proposed Screening Programs 27 strategies sometimes varies among testing sites (e.g., the specificity for mammography screening is lower in the United States than in many sites in Europe) (29), likely because of the use of different thresholds. In most screening situations, the frequency of false-positive tests is several times higher than true positives. Interestingly, specificities that function well in diagnostic testing situations may still be too low for minimizing harms from false positives in screening situations. For example, a screening test that is 96% specific falsely classifies 4% of all screened people without the PPH as positive. Because the great majority of people in screening situations do not have the PPH, there will be many more false-positive tests than true positive tests, resulting in many people being subjected to the harms of false-positive tests. 1b) Experience of people with a false-positive screening test. Although some commentators have minimized the importance of the anxiety effect of false-positive screening tests, systematic reviews show that they are real and important to consider (30). Most people with a positive screening test have short-term anxiety until the diagnostic testing has been completed. A smaller, but significant percentage of people suffer anxiety for an extended period of time. Interestingly, most screening studies show that a considerable number of people never undergo diagnostic testing after a positive screening test. The experience of these people should be investigated further. The percentage of people who have at least one falsepositive screening test over a time period is often quite high compared with the percentage who have a true positive test. Thus, even small effects for many people can add up to considerable harm for a screened population. False-positive tests can also cause harm through increased surveillance. For example, women with abnormal cervical cytology but no neoplasia on colposcopy, and women with abnormal mammography and no malignancy on breast biopsy, are frequently asked to return for more frequent testing. If subsequent tests continue to be minimally abnormal, surveillance may continue for many years. This situation creates a longer-term label from false-positive screening tests that can potentially lead to ongoing health anxiety and to repeated exposure to harm from diagnostic procedures. 2a) Frequency of overdiagnosis. Overdiagnosis in screening has been appreciated more and more as an important harm to consider in evaluating screening programs (31). Overdiagnosis, or unnecessary diagnosis, concerns the experience of people who have a true-positive screening test as opposed to false-positive tests. Although several definitions for overdiagnosis have been used, we suggest that the critical issue is whether earlier detection (at screening) compared with later detection (at clinical symptoms or signs) leads to increased (or earlier) labeling, diagnostic evaluation, or treatment that has potential adverse effects on health. When there would have been no detection without screening, then any adverse health effects following detection can be attributed to overdiagnosis. This is a somewhat expanded definition than is sometimes used, and it encompasses at least 3 clinical entities: 1) diagnosis of PPHs considered disease (e.g., asymptomatic early prostate cancer, diabetes, or coronary atherosclerosis) that would never have become clinically apparent in the patient s lifetime; 2) diagnosis of PPHs considered preconditions (e.g., colonic polyps, cervical intraepithelial neoplasia 1 or 2, impaired fasting glucose, mild cognitive impairment, ductal carcinoma in situ of the breast, 3.0 3.9-cm abdominal aortic aneurysms) that would never progress to clinical conditions, or may even regress; and 3) diagnosis of nonfatal PPHs that would progress to mild clinical symptoms (e.g., a small breast lump) but not to significant clinical problems. Overdiagnosis, then, is finding weak PPHs that do not predict adverse health states and do not need to be found or treated to protect the health of the screened person. One of the driving factors leading to overdiagnosis is the heterogeneity of conditions known as disease. Diseases with the same name often have markedly different health effects; some act in a malignant manner, leading to rapid demise, while others run a more benign course, with little long-term effect on health. One advantage of the PPH terminology is that the degree to which the PPH predicts adverse health outcomes (if known) can be specified; the PPH may vary in strength from strong to weak, or may be uncertain. The frequency of overdiagnosis has been contested. When the above definition is used, however, it becomes clear that the number of people who experience some harm from overdiagnosis often exceeds the number who benefit from screening (31). 2b) Experience of people who are overdiagnosed. The experience of overdiagnosis is similar to the experience of people newly diagnosed with a feared condition: it is often life-changing. The experience includes unnecessary psychological and physical effects from labeling, diagnostic evaluation, and treatment. The psychological effects of being told one has a potentially life-threatening condition (such as cancer, diabetes, dementia, or an abdominal aortic aneurysm) are obvious. Many people s lives are completely and permanently altered. Although our knowledge of the strength of PPHs is often limited (i.e., they are uncertain PPHs), there is a tendency to treat everyone with a newly diagnosed uncertain PPH as if it were a strong PPH. These treatments are often invasive and carry unintended effects. The unintended effects of evaluation or treatment are easier to tolerate for people who have a moderate or better probability of benefit, but they can be perverse for people with little probability of benefit. They have little probability of benefit because they have a low probability of developing clinically important health problems even if not treated. Detection of a weak PPH (sometimes buried within an uncertain PPH) is clearly overdiagnosis, and treatment constitutes overtreatment, with its attendant harms. In a few situations, clinicians have worked to develop approaches to reduce the intensity of treatment for people detected with weak PPHs people who have a low probability of developing adverse health outcomes. For example, there is interest in active surveillance rather than surgery for low-grade prostate cancer (32). While these approaches may reduce (although probably not eliminate) the frequency

28 Harris et al. of physical harms from unnecessary treatment, they will likely not reduce (and may even increase) the frequency, severity, and duration of harms from unnecessary labeling and increased surveillance. One of the problems leading to overdiagnosis is our current inability to determine in every case which PPHs are strong and which are weak. Although new research may improve our understanding of factors that strongly predict future adverse health and thus enable us to strengthen PPHs it may also be the case that inherent problems with the entire screening strategy limit our ability to accomplish this goal. If the factors that predict poor health outcomes are multifactorial, complex, and multiple (including genetic, environmental, and individual issues), then we may be very far indeed from finding ways to strengthen PPHs and avoid the problems of overdiagnosis. Because of increasingly sensitive screening tests (that detect many abnormalities that would never progress), wider application of those tests to broader populations (with a lower prevalence of strong PPHs), and screening at more frequent intervals, the potential for overdiagnosis has grown over the past 20 years. Because of the increased use of sensitive imaging tests, a new term has been coined incidentaloma to indicate unexpected abnormalities found essentially by screening (33). These abnormalities usually represent weak PPHs that often lead to further diagnostic testing (a type of diagnostic odyssey) and potential harm from overdiagnosis. PPHs that are preconditions are being added more and more as targets of screening programs, and they often represent the majority of abnormalities detected (e.g., 3 3.9-cm abdominal aortic aneurysms constitute about 70% of abnormalities detected in screening for this condition; on screening, many more people are found to have mild cognitive impairment than dementia; many more people are found to have polyps than colorectal cancer; many more people are found to have prediabetes than undiagnosed diabetes). Screening advocates defend detection of precondition PPHs as a major way in which screening programs lead to benefit. It is important, however, to analyze each situation separately. The key issues that separate future benefit from overdiagnosis are the rate of progression of the precondition to an adverse health outcome (i.e., the strength of the precondition as a PPH) and the incremental effectiveness of treatment at an earlier versus later time in reducing the probability of that adverse health outcome. If the precondition progresses infrequently to an adverse health outcome, or if earlier treatment is not significantly more effective than later treatment, then detection of the precondition is more likely to represent overdiagnosis than benefit. Evaluation of screening programs must examine evidence about these issues rather than assume that detection of PPHs that are preconditions is an unmitigated benefit. Another example of growing overdiagnosis is the expansion of screening to older populations or to those with a limited life expectancy because of other conditions (34). Because the benefits of screening usually occur sometime in the future while harms are often nearer term, many older or seriously ill people have a limited time in which to benefit, but they still may suffer the harms of screening. Detection of weak PPHs in these populations is clear overdiagnosis. For example, modeling for the USPSTF has shown increasing frequency of overdiagnosis for breast and colorectal cancers when people older than age 75 80 years are screened (35, 36), and other models (37) have shown similar results for cervical cancer screening. 3) Magnitude of harms (frequency and severity) inherent in the testing, workup, and treatment of the condition. Overdiagnosis causes greater physical harm if the workup or treatment to which the person is unnecessarily subjected is itself associated with harm. For example, a large percentage of men undergoing radical prostatectomy for prostate cancer suffer lifelong impotence or incontinence (38). Similarly, surgery for carotid artery stenosis sometimes causes the adverse health outcome (i.e., stroke) that it is intended to prevent (22). The harms of treatment, however, extend beyond negative effects on people who are overdiagnosed. The benefit of screening to detect a strong PPH is diminished by a treatment associated with frequent or severe harms. Because treatment is seldom 100% effective, some treated people suffer only or primarily harms rather than benefit. With screening, these harms are suffered for a longer period of time. Theoretically, earlier detection by screening may avoid more noxious treatment. In practice, however, many people with a screening-detected condition are treated as aggressively as people with the same clinically detected condition. Resource use. In evaluating screening programs, the USPSTF considers provider and patient time and effort but not financial costs. Any health program evaluation is incomplete unless financial costs have been taken into account in some way. Considering financial costs may involve a formal cost-effectiveness analysis or including costs in an outcomes table (see below), but decision makers have a fundamental need to understand what financial resources any net benefit of screening requires. It is important that total costs be considered (i.e., the societal perspective (39)), including the costs of the harms of the screening program and the costs incurred by people being screened. Considering financial costs may require different skills than assessing and synthesizing benefits and harms, and thus different groups of investigators may be involved in these analyses. To be most useful, these analyses should be able to compare the cost of a given net gain in population health by various screening strategies, designed to reduce the burden of suffering from various health problems (40). Assessing resource use, however, involves more than considering financial costs. For example, a screening program may provide a small net benefit for small financial costs and seem to have a reasonable cost-effectiveness ratio. Yet, implementation may lead to displacement of health care programs with greater net benefit or to greater health disparities, and thus may not be desirable for a population. Synthesizing benefits and harms: suggested approach to weighing benefits and harms to determine net benefits or net harms In evaluating proposed or existing screening programs, we emphasize the importance of balancing the magnitude