Evaluating new tests: Which characteristics are important? Graeme Young

Possible conflicts of interest Eiken Chemcial Company (institutional) Clinical Genomics P/L Name of presenter

Recommendations for a step-wise comparative approach to the evaluation of new screening tests for colorectal cancer. Working Party Report Cancer 2016; 122(6):826-39 Young GP, Senore C, Mandel J, Allison JE, Atkin W, Benamouzig R, Bossuyt P, DeSilva M, Guittet L, Halloran S, Haug U, Hoff G, Itzkowitz SH, Levin TR, Leja M, Levin B, McFarland EG, Meijer GA, O Morain CA, Parry S, Rabeneck L, Rozen P, Saito H, Schoen RE, Seaman HE, Steele RJC, Sung JJY, Winawer SJ. Host Societies: UEGF, WGO, OMED, BSG

Introduction Overview: To develop practical advice on how best to compare new with proven screening tests, the ideal context, the informative endpoints and the appropriate study design. Focus for today: to comment on the endpoints that matter.

Guiding principles 1. Screening aims to reduce the burden of disease in the population,. 2. The screening test is just 1 event in a process. 3. Population randomized controlled trials with mortality as the primary outcome set the standard for the evaluation of new tests. They give: a) clear guidance on intention-to-screen endpoints b) as well as the surrogate endpoints that facilitate prediction of benefit. 4. New tests can be assessed in parallel with an existing test. a) When an RCT has established that a test reduces mortality, a new test does not need to be so-evaluated provided that it is compared with the proven test. 5. New screening tests might detect a different biology. 6. In 2-step screening, a positive test increases the likelihood of neoplasia being present. 7. It is not ethically justifiable to proceed to study a test in the screening environment, including acceptability to invitees or other screening program outcomes, without studies indicating that the new test is of acceptable accuracy compared with a proven comparator test. 8. New tests must be (technically) defined.

Is a mortality-endpoint RCT needed? Where a screening test has been proved to be effective, then a direct comparison of the new with the proven will serve to inform the user of its benefit as its impact is understood and the surrogate measures of that impact are defined. Endpoints for comparison include: Performance measures (sensitivity, specificity) Program measures (participation, cost, etc). Context for comparison must ultimately be unbiased screening populations. Acceptable comparators: gfobt are the minimum standard. FIT are also acceptable (clearly superior to gfobt). Colonoscopy serves as the best means of diagnostic verification but does not allow accurate insight into a test s effectiveness

Accuracy Evaluation Determining true sensitivity and specificity for screen-relevant lesions is challenging. Absolute and relative estimates are required. For Absolute: What is good enough? All need colonoscopy*. For Relative: Two practical approaches: 1. Paired design (improves power) incorporating a proven comparator. Acceptable comparators: gfobt are the minimum standard. FIT are better (clearly superior to gfobt). Colonoscopy for diagnostic verification effectiveness 2. Diagnostic verification of every positive case. *unless one includes 2-4 years of follow up.

What is good enough? This depends on one s perspective. There are two key questions concerning clinical accuracy: 1. Detection a test that is more sensitive in practical terms returns more true-positives, 2. Burden associated with detection a test that is more specific in practical terms returns fewer false-positives. So, we learn a lot just by determining if a positive result is true or false for both the new test and the comparator!

Relative performance Detection and the burden of detection are readily estimated by a thorough diagnostic verification of every test-positive case (both comparator and new test-positives) to determine whether it is a truepositive or a false-positive. The simple dichotomous measures of the true-positive rate (TPR) and the false-positive rate (FPR) are direct and practical measures of accuracy, sometimes referred to as test operating characteristics. They are used when undertaking receiver operating characteristic (ROC) analysis. The TPR reflects detection (sensitivity), and the FPR reflects the burden associated with detection (1-specificity). Consequently, relative sensitivity and specificity are determined by comparing the TPR and FPR of the new and old tests, all achieved without resort to colonoscoping everyone.

Operating characteristics and accuracy Test result Positive Diagnostic verification; operating characteristic True; true-positive rate (TPR) Related accuracy characteristic Sensitivity. Positive predictive value (TPR/TPR+FPR). Issue addressed Detection. Efficiency of detection. False; false-positive rate (FPR) Negative True; true-negative rate Specificity (1 FPR) Specificity Burden associated with detection Elimination/exclusion of disease False; false-negative rate Missed lesion Burden of failed detection

Desirable values Test result Related accuracy characteristic Desired attribute, absolute and relative. Positive Sensitivity. > 75% (cancer). > FIT head-to-head PPV (TPR/TPR+FPR). Specificity (1 FPR) Negative Specificity Missed lesion

FIT accuracy for CRC 75% 94% Young GP, et al. Advances in Fecal Occult Blood Tests: The FIT Revolution. Dig Dis Sci. 2015; 60: 609-622. 15

Desirable values Test result Related accuracy characteristic Desired attribute, absolute and relative. Positive Sensitivity. > 75% (cancer). > FIT head-to-head Negative PPV (TPR/TPR+FPR). > 3% (cancer) or >15% for advanced neoplasia. FIT head-to-head Specificity (1 FPR) In range 85-98%, ideally 95%. Subject to formal costeffectiveness studies. Better to use test positivity rate in target population. FIT head-to-head when set at equivalent sensitivity. Specificity Missed lesion

Test evaluation framework Population Single, paired testing 1 Multiple, randomised, ITS 2 Existing New screening test screening test Neg Pos Pos Neg Colonoscopy 1, for testing accuracy 2, for testing population outcomes.

Pathway evaluation Larger-scale evaluation in the screening context is ultimately required to justify large-scale uptake of a new test. Other crucial variables in the screening pathway. safety, cost, feasibility, ease of use for a screenee to perform, and Participation and re-participation. Outcomes must be evaluated on an intention-toscreen basis.

Conclusions The Recommendations provides a framework for evaluating a new screening test for its effectiveness in population screening, by comparing it to a proven test. Evaluating screening tests is not a simple matter of comparing how well they detect neoplasia, but it starts with estimates of accuracy. Exactly which parameters are considered acceptable at each phase of evaluation depend upon screening program philosophy. Estimates of accuracy might be absolute or relative. Each approach has its place. Regulatory authorities like absolute estimates made in typical target populations. Sensitivity should be FIT but not disregarding specificity (workload, cost, anxiety and so on). Report specificity at equivalent sensitivity.