NOVEL BIOMARKERS PERSPECTIVES ON DISCOVERY AND DEVELOPMENT. Winton Gibbons Consulting

Similar documents
Statistical Considerations: Study Designs and Challenges in the Development and Validation of Cancer Biomarkers

Hypothesis-Driven Research

What You Will Learn to Do. Linked Core Abilities Build your capacity for life-long learning Treat self and others with respect

FMEA AND RPN NUMBERS. Failure Mode Severity Occurrence Detection RPN A B

Foundations for Success. Unit 3

Introduction to ROC analysis

DOING SOCIOLOGICAL RESEARCH C H A P T E R 3

The one thing to note about Bazerman is that he brings deep and broad

Attribute Importance Research in Travel and Tourism: Are We Following Accepted Guidelines?

Question Sheet. Prospective Validation of the Pediatric Appendicitis Score in a Canadian Pediatric Emergency Department

THE VOICE DOES MATTER

Sound Off DR. GOOGLE S ROLE IN PRE-DIAGNOSIS THROUGH TREATMENT. Ipsos SMX. June 2014

Reliability, validity, and all that jazz

The Essentials of Jury De-Selection

5 MISTAKES MIGRAINEURS MAKE

Seven Questions to Ask Your Next Dentist

Sample size calculation a quick guide. Ronán Conroy

Social Biases and Pressures. Critical Thinking

crucial hearing health questions Bring these questions to your next appointment with your audiologist, doctor or hearing care professional

The Top Seven Myths About Hypnosis And the real truth behind them!

Healthcare lessons learned from Fantasy Football. Joshua W. Axene, ASA, FCA, MAAA

PSYC1024 Clinical Perspectives on Anxiety, Mood and Stress

Lost it. Find it. Already Have. A Nail A Mirror A Seed 6/2/ :16 PM. (c) Copyright 2016 Cindy Miller, Inc. (c) Copyright 2016 Cindy Miller, Inc.

Incorporating Clinical Information into the Label

Vocabulary. Bias. Blinding. Block. Cluster sample

CHAPTER NINE DATA ANALYSIS / EVALUATING QUALITY (VALIDITY) OF BETWEEN GROUP EXPERIMENTS

Comparing Shiatsu & Acupuncture an Interview with the late Dianna Cheong Jamie Hamilton MRSS(T), Lic. Ac.

SAP s Autism at Work Program Provides Meaningful Employment for People on the Autism Spectrum

UNLOCKING VALUE WITH DATA SCIENCE BAYES APPROACH: MAKING DATA WORK HARDER

Two-Way Independent ANOVA

Designing Psychology Experiments: Data Analysis and Presentation

Video Transcript Sanjay Podder:

Audio: In this lecture we are going to address psychology as a science. Slide #2

CASE STUDY 2: VOCATIONAL TRAINING FOR DISADVANTAGED YOUTH

Glossary From Running Randomized Evaluations: A Practical Guide, by Rachel Glennerster and Kudzai Takavarasha

Measures of validity. The first positive rapid influenza test in the middle FROM DATA TO DECISIONS

CHAPTER 15: DATA PRESENTATION

Father of Strengths Psychology and Inventor of CliftonStrengths

ONCOLOGY: WHEN EXPERTISE, EXPERIENCE AND DATA MATTER. KANTAR HEALTH ONCOLOGY SOLUTIONS: FOCUSED I DEDICATED I HERITAGE

Jack Grave All rights reserved. Page 1

Risk Aversion in Games of Chance

Introduction. We can make a prediction about Y i based on X i by setting a threshold value T, and predicting Y i = 1 when X i > T.

Chapter 13 Summary Experiments and Observational Studies

Discussion Meeting for MCP-Mod Qualification Opinion Request. Novartis 10 July 2013 EMA, London, UK

Psychology Research Process

Putting the focus on conversations

Research Methods 1 Handouts, Graham Hole,COGS - version 1.0, September 2000: Page 1:

Teacher In-Service: Interpreters in the Classroom

Sample Size Considerations. Todd Alonzo, PhD

Ultrasonic Phased Array Inspection of Turbine Components

THEORIES OF PERSONALITY II Psychodynamic Assessment 1/1/2014 SESSION 6 PSYCHODYNAMIC ASSESSMENT

Chapter 13. Experiments and Observational Studies. Copyright 2012, 2008, 2005 Pearson Education, Inc.

Mantel-Haenszel Procedures for Detecting Differential Item Functioning

Risk Tolerance and Behavioral Finance

Gathering. Useful Data. Chapter 3. Copyright 2004 Brooks/Cole, a division of Thomson Learning, Inc.

Sheila Barron Statistics Outreach Center 2/8/2011

Evaluation of Linguistic Labels Used in Applications

advanced/proficiency (C1/C2). to understand and practise using multi-word verbs and idioms in a business context. Timing: approximately 30 minutes.

The Attentional and Interpersonal Style (TAIS) Inventory: Measuring the Building Blocks of Performance

Supporting children with anxiety

PATIENT AND FAMILY SATISFACTION. Dr. James Stallcup, M.D.

Exploration and Exploitation in Reinforcement Learning

Two-sample Categorical data: Measuring association

Still important ideas

FDA/CFSAN: Guidance on How to Understand a...e the Nutrition Facts Panel on Food Labels

Math 140 Introductory Statistics

Diagnostic tests, Laboratory tests

[PDF] DBTÂ Skills Training Handouts And Worksheets, Second Edition

The Essential Role of the Healthcare Consumer

The Impact of Cognitive Biases on Test and Project Teams

Clever Hans the horse could do simple math and spell out the answers to simple questions. He wasn t always correct, but he was most of the time.

Patrick Breheny. January 28

Pooling Subjective Confidence Intervals

Lesson Plan. Class Policies. Designed Experiments. Observational Studies. Weighted Averages

Myers-Briggs: Understanding Personality Type and Communication

Finding (or Developing) Telehealth Champions

8/28/2017. If the experiment is successful, then the model will explain more variance than it can t SS M will be greater than SS R

Test Anxiety: The Silent Intruder, William B. Daigle, Ph.D. Test Anxiety The Silent Intruder

One-Way Independent ANOVA

Sampling for Success. Dr. Jim Mirabella President, Mirabella Research Services, Inc. Professor of Research & Statistics

Hosts. New Treatments for Cancer

Experimental Psychology

AUDIOLOGY MARKETING AUTOMATION

Real-world data in pragmatic trials

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Assignment 4: True or Quasi-Experiment

Standard Scores. Richard S. Balkin, Ph.D., LPC-S, NCC

National Press Club Survey Results September In partnership with:

EXPERT INTERVIEW Diabetes Distress:

Making Your Treatment Work Long-Term

JABRA UNVEILS THE KEY TO ITS INDUSTRY LEADERSHIP

Analysis of Variance (ANOVA)

Cardiovascular Controversies: Exploring the ACC and AHA Guidelines on the Treatment of Blood Cholesterol

SYSTEMATIC REVIEWS OF TEST ACCURACY STUDIES

An Experimental Investigation of Self-Serving Biases in an Auditing Trust Game: The Effect of Group Affiliation: Discussion

NEXT. Top 12 Myths about OEE

How to Conquer a Chromosome Abnormality What is our strategy for identifying treatments?

UNIT II: RESEARCH METHODS

Workgroup Webinar Tuesday, May 26, :00 p.m.

Student Success Guide

Transcription:

NOVEL BIOMARKERS PERSPECTIVES ON DISCOVERY AND DEVELOPMENT Winton Gibbons Consulting www.wintongibbons.com

NOVEL BIOMARKERS PERSPECTIVES ON DISCOVERY AND DEVELOPMENT This whitepaper is decidedly focused on themes that have more quantitative aspects. However a general understanding of these would be important for senior and department management to be forewarned by what they might hear or be pitched relative to new biomarkers. First will be a discussion of the overall discovery and development process. Then there will be discussions of two aspects of the misunderstood ROC curve. Biomarker Discovery and Development Every day there are announcements of new biomarker discoveries. Yet, very few biomarkers are validated, and even fewer are ever used clinically. Having a relevant process, and understanding some of the issues upfront to avoid pit falls should improve this. It is important to understand the nature of luck and randomness when discovering novel biomarkers. Setting the Stage Chance favors the prepared mind. Louis Pasteur The greatest general is he who makes the fewest mistakes. Napoleon I would rather have lucky general than a good one. Napoleon 2016 Winton Gibbons 2 Copyright 2016 Winton Gibbons

What lessons are there to help improve this situation? The next figure illustrates a high-level approach. It follows best practices, which include using upfront hypothesis testing and validation, as well as avoiding red flags and opportunities for bias in designing experiments. At the end of the essay is quantitative information from a simulation using real-world data. Approach for Novel Biomarker Discovery and Development Do hypothesis-driven homework--read the literature, go to conferences, and speak with investigators Understand the clinical environment Avoid trial-design red flags Understand over-fitting Parallel track R&D, clinical regulatory, and commercial Fail fast (on poor markers) Validate Validate Validate 2016 Winton Gibbons Copyright 2016 Winton Gibbons 3

More specifically, the following example shows some practical, specific milestones. These included suggested types and number of samples, as well as a bit about prevalence in upfront discovery. Validation should almost always be done using all-comers who exhibit signs and symptoms of the disease, rather easy comparisons such as near normal patients. Concurrent assay development and clinical testing Feasibility: can the assays be made? Spiked samples Feasibility relevant to platform Clinical utility comparable (better) than literature competition? Manual or robotic assays. Disease : normal (better if symptomatic without disease) 1:1 case control, or stratified sample (unlikely at this part of R&D) 100 samples if testing 12 or fewer biomarkers algorithms. 250 samples if 50 or fewer. 500 if more. First pass optimization and assessment? Alpha or beta assays on instrument. If same protocol, then can allow for more false discoveries in previous step (avoiding missing true discoveries). 2016 Winton Gibbons Commercial assay improvement > current clinical utility competition? Real first test to set prospective cut-offs, etc. Prevalence ratio should be typical for the disease >10% prevalence means 250 patients. <10% means 500. Sufficient clinical competitive performance for commercialization? All-comers Greater than 1,000 patients to allow for lower prevalence (can do power calculation, but publication and marketing need this or more patients) 4 Copyright 2016 Winton Gibbons

As noted in the overall process, trial design issues should be avoided. The major biases should be assessed: selection of proper samples or patients, verification of only those screened in for testing, inclusion of the training set of samples with the test set of samples or patients in the final analysis, and poor blinding. Often the last point is hard implement in a research environment, as scientists view this an affront to their objectivity. However, there is an abundance of data showing that this is necessary, no matter how well intentioned. Big Four Biases Almost always all-comers in a clinical setting preferred There may be practical limitations For discovery, specific, other selection methods may have value. Selection (e.g., ascertainment, spectrum) Verification (confirmation) Not following and assessing presumed nondiseased Blinding Inclusion (incorporation) Double blind preferred Physician should not know lab results before diagnosis Lab should not know diagnosis before running test Including discovery set with validation set to estimate performance 2016 Winton Gibbons Copyright 2016 Winton Gibbons 5

If the big 4 biases are not controlled, then a higher estimated performance than appropriate will likely be calculated. The next figure illustrates how various biases or experimental designs can increase or decrease the estimate performance of a test. Biased Design Effects on Relative Estimate of Diagnostic Performance Severe cases and case controls Referral for test Nonconsecutive sample Retrospective data collection Single or nonblinded Post hoc cutoff Partial verification Inclusion Higher estimate Other case control As part of other test results Random sample Prospective data collection Double blinded Predetermined cutoff Lower estimate Source: CMAJ, JAMA 2016 Winton Gibbons If one cannot get sufficient numbers of patients there are ways to mitigate potential false discoveries. The most used approaches are: Bonferroni correction (most conservative) Divide the desired p value (probability of true discovery) by the number of biomarkers or algorithms tested. This establishes the new p value that any biomarker or algorithm must pass. False discovery rate Similar to Bonferroni for assessment of the biomarker algorithm with the best p value. For subsequent, the desired p value is divided by the number of biomarkers algorithms remaining to be assessed (i.e., the correction gets easier if some biomarkers algorithms pass) 6 Copyright 2016 Winton Gibbons

Simulation To give some specificity to potential ways to improve, a simulation was conducted using actual biomarker data in order to help frame quantitatively sample sizes, numbers of markers, and often less considered, disease prevalence. The key lessons on proportions were Don t use less than 250 patients, even when assessing only a few markers Start to beware retrospective individual marker discovery at 50 potential markers, in the context above For multi-marker indices, beware starting at 25 potential markers When prevalence is below 12%, then use more than 1,000 patients Using 500 to 1,000 patients with a prevalence greater than 12%, is relatively good, even up to assessing 100 markers In the simulations run the number of patients varied from 50 to 1,000 patients, the number of markers varied from 1 to 100, and the prevalence varied from 6% to 50%. The further noteworthy findings included Degrees of freedom can dramatically affect retrospective biomarker analysis. As either the prevalence, or number of patients decrease, the higher the risk for perceived but random positive results in marker mining. False AUCs (area under the curves or c-statistics) can be quite high. The average experimental AUC for random single markers was 0.62, with the highest a whopping 0.97. The average experimental AUC for random multi-marker indices was 0.65, with the highest a perfect 1.00. Biomarkers have great value, but only when valid. Having an approach, understanding the patient numbers required, and avoiding biases should increase the chance of success. Below is a presentation of the material presented above. Copyright 2016 Winton Gibbons 7

A Myth about ROC Curves and C-statistics Area under the curve (AUC or c-statistic) is not paramount. Shape often matters more. What is the issue? It boils down to the clinical use of a particular diagnostic. This is not represented by the area under the curve (AUC), or c-statistic. It is determine by the shape of the curve. There are other nuances as well. To begin with, the AUC is only a rough guide of what s good, and for the most part predominantly useful in comparing curves of the same contour. The bigger the area, for one of a group of curves with the same profile, the better. However, curve shape matters. A lot. As can be seen in the figure below, a steep shape at the bottom near the y-axis (highspecificity or low false positive fraction) is best for rule in. The patient has the disease. Likewise, a shallow, asymptotic shape at the top, with very high sensitivity, is best for rule out as not having the disease. So, curves that are big at the top, or bottom, are generally more clinically useful than those that are big in the middle. Many (most?) curves are big in the middle. * In contrast, the skewed shapes have a more straightforward clinical value. This can be so, even if the AUC is lower for the skewed curve than a more well rounded shape. 8 Copyright 2016 Winton Gibbons

What are the implications? For one thing, the idea that the optimal cut-off for an ROC curve is at the 45 slope inflection is incongruous. The cut-off should be set clinically, based on the treatment algorithm, the risks of false negatives, and the costs and risks of false positives. In fact, for a curve such as the one below (real, disguised data), two cut offs would be appropriate, and the diagnosis of patients between them indeterminate. What does this mean practically? Lately, ROC analysis is being used a lot for biomarker discovery, multi-marker indexes (FDA termed IVD MIAs), and the like. Differences in areas under the curves (help) drive selection of the markers. Often, lists of biomarkers or algorithms are merely ranked by AUCs (cstatistics) as a screen high good, low bad. Of course, if too many biomarkers are being assayed against too few patient samples, this traditional use of ROCs, breaks down even more (so called false discovery, a topic for another day). It goes without saying, that before this kind of analysis comparing biomarker performance, one should understand the clinical issues and fit for the diagnostic. With this understanding, trade offs between false positives and negatives, sensitivity versus specificity, or importance of positive or negative predictive value can be reasoned. That shows where on the curve one should be, and what contour can work best. The clinically valuable biomarker, and its ROC, may appear quite middling in the normally used stats, but be the most useful. Copyright 2016 Winton Gibbons 9

So it s time to slog it out. Look at and compare the shapes. For those that appear useful and similar in profile, the AUC is the right tool, and a great tool, for selecting the best biomarker(s). Use it then. ROC Area Under the Curve (AUC) Subtlety that is Often Forgot For those who know ROCs well, this discussion might likely be obvious and well known. However, for many practical users of ROCs, this could be surprising. In addition to AUCs and sensitivities and specificities, positive and negative predictive values (PPVs and NPVs) are mentioned more and more, as they should be, given their clinical importance. Some (often?) times when PPV and NPV are discussed, the formula assumes that the prevalence is equal between the diseased and the normal populations. This does not take into account disease prevalence, which most often has dramatic results. ** Example: From this example, one can see the classic issue with screening assays. Even with decent sensitivity and specificity, the PPV is untenable. For most disease requiring screening, the prevalence is even lower, <0.1%, and he situation worse (95% sensitivity and specificity lead to only a 2% PPV). Additionally, at low disease prevalence the NPV is 99+% even for very, very poor assays. ( You very probably don t have the disease. ) As can be seen, at higher disease prevalence, it is hard to rule out disease (NPV). Disease prevalence is critical (and can be manipulated by inclusion and exclusion criteria of a study). The implication of all this is that the rules of thumb regarding AUCs don t have value except in light of disease prevalence very often not mentioned in study results. Even though, more and more NPV and PPV are mentioned, and calculated using disease prevalence, authors and readers gravitate towards the comfortable AUC metric. This is doubly compounded by the choice of cut-offs (and as we know from the shape discussion, perhaps two cut-offs should most often be chosen one best for PPV and on for NPV). 10 Copyright 2016 Winton Gibbons

Companies harp on AUCs, often mediocre even themselves. So, we re back to that simple and misleading metric again. Shape is often more important. When considering ROCs, it is crucial to understand the curve s shape, the epidemiology (e.g., prevalence), and of course the intended clinical use of the test. * There are exceptions, for outstanding assays, like troponin for heart attack, or CCP for rheumatoid arthritis. ** This inaccurate approach leads to PPV equaling true positives divided by (true positives plus false positives). For NPV, the value would be true negatives divided by (true negatives plus false negatives). Preferred nomenclature and calculations Winton Gibbons consults to leaders and investors in medical and life science products about complex and difficult issues. He provides sophisticated quantitative and qualitative analysis, backing strategically sound scenarios and recommendations. This is based on deep real-world experience, and an ability to quickly assess markets and technologies. Strategy Innovation Market Financial Assessments Corporate Business Development www.wintongibbons.com Copyright 2016 Winton Gibbons 11