Advanced Power: After Power What Then? By Dan Ayers Senior Associate in Biostatistics

Similar documents
About Reading Scientific Studies

Response to the ASA s statement on p-values: context, process, and purpose

How to craft the Approach section of an R grant application

Raya Abu-Tawileh. Dana Alrafaiah. Hamza Alduraidi. 1 P a g e

Introduction; Study design

Multiple Regression Models

Chapter 25. Paired Samples and Blocks. Copyright 2010 Pearson Education, Inc.

AP STATISTICS 2013 SCORING GUIDELINES

Hypothesis-Driven Research

Action science skills for managing down, across or up the ladder

Where does "analysis" enter the experimental process?

Probability and Statistics Chapter 1 Notes

Experimental Design for Immunologists

CHAMP: CHecklist for the Appraisal of Moderators and Predictors

Registered Reports: Peer reviewed study pre-registration. Implications for reporting of human clinical trials

Audio: In this lecture we are going to address psychology as a science. Slide #2

Pacira v. FDA: Summary of Declaration by Lee Jen Wei, PhD Concluding that the Pivotal Hemorrhoidectomy Study for EXPAREL Demonstrated a Treatment

Critical Appraisal Istanbul 2011

MODULE 3 APPRAISING EVIDENCE. Evidence-Informed Policy Making Training

Hypothesis generation and experimental design. Nick Hunt Bosch Institute School of Medical Sciences

1. What is the difference between positive and negative correlations?

Quality Digest Daily, March 3, 2014 Manuscript 266. Statistics and SPC. Two things sharing a common name can still be different. Donald J.

Worksheet for Structured Review of Physical Exam or Diagnostic Test Study

Our goal in this section is to explain a few more concepts about experiments. Don t be concerned with the details.

Unit 1 History and Methods Chapter 1 Thinking Critically with Psychological Science

TESTING FEAR OF SOCIAL ISOLATION

Timothy Wilken, MD. Local Physician, Synergic Scientist, and Perennial Student.

More on the Concepts of Risk, Odds, the Relative Risk Ratio, and the Odds Ratio STAT 110 Fall 2017

Evaluating you relationships

Learning objectives. Examining the reliability of published research findings

Patrick Breheny. January 28

17/10/2012. Could a persistent cough be whooping cough? Epidemiology and Statistics Module Lecture 3. Sandra Eldridge

Bayesian and Frequentist Approaches

The Reproducibility Debate in Science

Making Ethical Decisions

lab exam lab exam Experimental Design Experimental Design when: Nov 27 - Dec 1 format: length = 1 hour each lab section divided in two

Optimization and Experimentation. The rest of the story

Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data

Homework Answers. 1.3 Data Collection and Experimental Design

UNIT. Experiments and the Common Cold. Biology. Unit Description. Unit Requirements

Study Design. Study design. Patrick Breheny. January 23. Patrick Breheny Introduction to Biostatistics (171:161) 1/34

Confidence Intervals On Subsets May Be Misleading

(true) Disease Condition Test + Total + a. a + b True Positive False Positive c. c + d False Negative True Negative Total a + c b + d a + b + c + d

CHAPTER 5: PRODUCING DATA

Sampling Controlled experiments Summary. Study design. Patrick Breheny. January 22. Patrick Breheny Introduction to Biostatistics (BIOS 4120) 1/34

UNLOCKING VALUE WITH DATA SCIENCE BAYES APPROACH: MAKING DATA WORK HARDER

Analysis of Variance (ANOVA)

August 26, Dept. Of Statistics, University of Illinois at Urbana-Champaign. Introductory lecture: A brief survey of what to

How is ethics like logistic regression? Ethics decisions, like statistical inferences, are informative only if they re not too easy or too hard 1

WS0210 Practical 1.Introducing fungi and experimental design

False Positives & False Negatives in Cancer Epidemiology

Question: What steps do scientists follow in conducting scientific research?

Science in Natural Resource Management ESRM 304

Sample Size, Power and Sampling Methods

Eliminative materialism

SEMINAR ON SERVICE MARKETING

Common Statistical Issues in Biomedical Research

Christopher Chambers Cardiff University. Rob McIntosh University of Edinburgh. Zoltan Dienes University of Sussex

e.com/watch?v=hz1f yhvojr4 e.com/watch?v=kmy xd6qeass

A Case Study: Two-sample categorical data

USE AND MISUSE OF MIXED MODEL ANALYSIS VARIANCE IN ECOLOGICAL STUDIES1

Safeguarding public health Subgroup analyses scene setting from the EU regulators perspective

Design of Experiments & Introduction to Research

Variables Research involves trying to determine the relationship between two or more variables.

Lecture 2. Key Concepts in Clinical Research

Academic Ethics. Sanjay Wategaonkar Department of Chemical Sciences. 8 th August 2016

RESEARCH METHODS. Winfred, research methods, ; rv ; rv

EMA Workshop on Multiplicity Issues in Clinical Trials 16 November 2012, EMA, London, UK

Why is science knowledge important?

NOVEL BIOMARKERS PERSPECTIVES ON DISCOVERY AND DEVELOPMENT. Winton Gibbons Consulting

Research Methodology. Characteristics of Observations. Variables 10/18/2016. Week Most important know what is being observed.

Experimental Design There is no recovery from poorly collected data!

RESEARCH METHODS. Winfred, research methods,

AND YOUR LIFE FINDING A SOLUTION FOR HEARING LOSS

The Logic of Data Analysis Using Statistical Techniques M. E. Swisher, 2016

CLINICAL PROTOCOL DEVELOPMENT

UNIT II: RESEARCH METHODS

The Scientific Enterprise. The Scientific Enterprise,

If I m interested, why is it a conflict? Ross McKinney, Jr, MD

CHAPTER OBJECTIVES - STUDENTS SHOULD BE ABLE TO:

Inferential Statistics

Sleeping Beauty is told the following:

TEACHING BAYESIAN METHODS FOR EXPERIMENTAL DATA ANALYSIS

Math 1680 Class Notes. Chapters: 1, 2, 3, 4, 5, 6

plural noun 1. a system of moral principles: the ethics of a culture. 2. the rules of conduct recognized in respect to a particular group, culture,

The Cognitive Model Adapted from Cognitive Therapy by Judith S. Beck

2 Critical thinking guidelines

Paper presentation: Preliminary Guidelines for Empirical Research in Software Engineering (Kitchenham et al. 2002)

The Process of Scientific Inquiry Curiosity. Method. The Process of Scientific Inquiry. Is there only one scientific method?

5. MEDICARE COVERAGE OF SCREENING FOR OPEN-ANGLE GLAUCOMA. Costs to Medicare

Evidence Informed Practice Online Learning Module Glossary

Gathering. Useful Data. Chapter 3. Copyright 2004 Brooks/Cole, a division of Thomson Learning, Inc.

Retrospective power analysis using external information 1. Andrew Gelman and John Carlin May 2011

Collection of Biostatistics Research Archive

Chapter 23. Inference About Means. Copyright 2010 Pearson Education, Inc.

Introduction & Basics

Observational study is a poor way to gauge the effect of an intervention. When looking for cause effect relationships you MUST have an experiment.

CHAPTER 9.1. Summary

Safeguarding public health Subgroup Analyses: Important, Infuriating and Intractable

Transcription:

Advanced Power: After Power What Then? By Dan Ayers Senior Associate in Biostatistics

What do you actually get? One measure for this is the probability that a random subject selected from a treated group will have a larger (smaller) value than a random subject from the untreated (or control) population. Simple case Control Tumor Volume (mm 3 ) ~ N(1200, 587) Treatment 1 Tumor Volume (mm 3 ) ~ N(826, 587) Treatment 2 Tumor Volume (mm 3 ) ~ N(533, 587)

What do you actually get? Control group TV ~ N(1200, 300) No Effect Treatment TV ~ N(1200, 300) Pr(Y<X) = 0.5 Pr(X<Y) = 0.5 Odds (Y<X) = 1 0 800 1600 2400 Tumor Volume (mm 3 )

What do you actually get? Control group TV ~ N(1200, 300) Treatment 1 TV (E=1/2 S.D) ~ N(1050, 300) Pr(Y<X) = 0.64 Pr(X<Y) = 0.36 Odds (Y<X) = 1.78 0 800 1600 2400 Tumor Volume (mm 3 ) Sample size = 63 per group

What do you actually get? Control group TV ~ N(1200, 300) Treatment 1 TV (E=1 S.D) ~ N(900, 300) Pr(Y<X) = 0.76 Pr(X<Y) = 0.24 Odds (Y<X) = 3.3 0 800 1600 2400 Sample size = 17 per group Tumor Volume (mm 3 )

Decision Error in Experiments Truth Study Says No Effect Effect No Effect correct False positive Effect False negative correct

Decision Error in Experiments Of course, we don t know the truth, that s why we are doing the experiment. Can still control the 2 types of error using statistical methods (i.e. power calculations) False positive = Type I error ( ) False negative = Type II error ( ) Overall, we won t be wrong very often. Right?

What do you actually get? Is the type I and type II error rate approach a good decision approach theoretically? Simple case 5% and 30% of 1000 investigators make true hypotheses and each test at type I and type II errors of 5% and 20% How many correct conclusions (true positives and true negatives) will we make? How many true differences are detected?

R=0.0526 = 0.05 = 0.20 5% True Effects 1000 Hypotheses 50 True Relationships (1- ) 40 Statistically Significant Results 950 False Relationships 48 Statistically Significant Results An Assumption! Thus out of 1000 hypotheses, 88 will have statistically significant evidence of a true effect in their favor. Of these, 40 (45%) are in fact true.

R=0.429 = 0.05 = 0.20 30% True Effects 1000 Hypotheses 300 True Relationships (1- ) 240 Statistically Significant Results 700 False Relationships 35 Statistically Significant Results An Assumption! Thus out of 1000 hypotheses, 275 will have statistically significant evidence of a true effect in their favor. Of these, 240 (87%) are in fact true.

Type I error = 5%, Power=80% Percent of 1000 Tests 0.2 0.4 0.6 0.8 1.0 True Differences Discovered Correct Conclusions Made 0.0 0.2 0.4 0.6 0.8 1.0 Percent of Hypothesized Differences That Are True

Problem Investigators consistently face the requirement to replicate research, including studies that were fully powered to have a high probability of detecting meaningful effects Basic science Publication requirements of reviewers Clinical science FDA IND approval

Why Repeat Testing Basic science experiments highly sensitive to known and unknown sources of variability Most scientists have experienced (perhaps many times) the situation where results have been completely reversed upon repeated testing Relatively low chance a result will be confirmed by others Repeatability illustrates PI expertise and robustness of experimental results. Referee s (journals) require replication of even fully powered studies Replicate once or twice to get confirmation of the statistical outcome or Replicate only statistically significant results.

To Repeat or Not to Repeat, That is the Question! Should the IACUC allow investigators to repeat fully powered experiments 2 or 3 times? Is it appropriate for journals to require replicated experiments?

Journal of Biological Chemistry "We do not have any written policy on the question raised. The requirement is fundamentally that all experiments are hopefully reproducible, and have been replicated sufficiently to be statistically significant and to justify presentation of the data and conclusions. The specific requirements will vary with each manuscript and with the opinion of the reviewers in any specific case."

Decision Trees for Testing Behavior Repeat as planned regardless of outcome Single repeat 2 repeats Repeat only if first results statistically significant Single repeat 2 repeats

3 Experiments No Treatment Effect (only error is a false positive) Test 1 0.05 0.95 Test 2a Test 2b 0.05 2 0.05*0.95 0.95*0.05 0.95 2 Test 3a Test 3b Test 3c Test 3d 0.05 3 0.05 2 *0.95 0.05 2 *0.95 0.05*0.95 2 0.05 2 *0.95 0.05*0.95 2 0.05*0.95 2 0.95 3 Probability of 2 Correct Conclusions in 2 Experiments =0.90 Probability of 2 Correct Conclusions in 3 Experiments =0.95 Probability of 3 Correct Conclusions in 3 Experiments =0.86 False Positive True Negative

Three Experiments True Treatment Effect (only error is a false negative) Test 1 0.80 0.20 Test 2a Test 2b 0.80 2 0.80*0.20 0.80*0.20 0.20 2 Test 3a Test 3b Test 3c Test 3d 0.80 3 0.80*0.20 2 0.80 2 *0.20 0.80*0.20 2 0.80 2 *0.20 0.80*0.20 2 0.80*0.2 2 0.2 3 Probability of 2 Correct Conclusions in 2 Experiments =0.64 Probability of 2 Correct Conclusions in 3 Experiments =0.77 Probability of 3 Correct Conclusions in 3 Experiments =0.51 False Positive True Negative

3 Experiments if First is Statistically Significant No Treatment Effect (only error is a false positive) Test 1 0.05 0.95 Test 2a Test 2b 0.05 2 0.05*0.95 0.95*0.05 0.95 2 Test 3a Test 3b Test 3c Test 3d 0.05 3 0.05 2 *0.95 0.05 2 *0.95 0.05*0.95 2 0.05 2 *0.95 0.05*0.95 2 0.05*0.95 2 0.95 3 Probability of 2 Correct Conclusions in 2 Experiments =0 Probability of 2 Correct Conclusions in 3 Experiments =0.045 Probability of 3 Correct Conclusions in 3 Experiments =0 False Positive True Negative

3 Experiments if First is Statistically Significant True Treatment Effect (only error is a false negative) Test 1 0.80 0.20 Test 2a Test 2b 0.80 2 0.80*0.20 0.80*0.20 0.20 2 Test 3a Test 3b Test 3c Test 3d 0.80 3 0.80*0.20 2 0.80 2 *0.20 0.80*0.20 2 0.80 2 *0.20 0.80*0.20 2 0.80*0.2 2 0.2 3 Probability of 2 Correct Conclusions in 2 Experiments =0.64 Probability of 2 Correct Conclusions in 3 Experiments =0.77 Probability of 3 Correct Conclusions in 3 Experiments =0.51 False Positive True Negative

R=0.0526 = 0.955 = 0.36 Confirming Only If First Result Statistically Significant 1000 Hypotheses 50 True Relationships (1- ) 32 Statistically Significant Results 950 False Relationships 908 Statistically Significant Results An Assumption! Thus out of 1000 hypotheses, 940 will have statistically significant evidence of a true effect in their favor. Of these, 32 (3.4%) are in fact true.

Solution: Better Study Design Blind replication of studies is an ill conceived ad hoc practice that significantly hinders scientific discovery. Hypothesis testing and reproducibility are goals that can easily be met by experimental design In this environment, the two-sample t-test should be retired.

Reproducibility By Design Focus is not to repeat entire experiments, but to account for and estimate nuisance sources of variability by design Simple Example: Combining data from 3 sources (e.g., cell lines revived at 3 time periods over a year)

Normalized qrtpcr Data Gene1 Gene 2 WT Mutant WT Mutant 0.0001069 0.00003296 0.0002474 0.0003908 0.0007165 0.00002966 0.0002090 0.0003623 0.0007848 0.00003769 0.0003767 0.0002796 0.0001624 0.0003840 0.0001342 0.0003971 0.00013842 0.00003794 0.0009806 0.0006786 0.00027723 0.00003863 0.0009798 0.0008705 0.00025064 0.00003857 0.0008997 0.0020228 0.00013400 0.0007913 0.00018913 0.0007527 0.00017758 0.00004626 0.0016572 0.0043454 0.00012128 0.00002270 0.0014788 0.0030829 0.00012925 0.0011352 0.0019315 0.00012800 0.0012355

Analysis Gene Percent of Random Error Explained by Time Ignoring Time Wildtype vs Mutant Accounting for Time Time Wildtype vs Mutant 1 31% p<0.0001 p=0.04 p<0.0001 2 63% p=0.09 p<0.0001 P=0.022

Summary Statisticians and cancer biologists should intimately understand what a power calculation is providing. Overall hypothesis testing framework seems to work well Neither type I or type II error have the impact that high rates of correct hypotheses have on having a high success yield from the hypothesis testing framework The practice of confirming only a statistically significant initial result is disastrous. statisticians and biologists should realize the simple two-sample t-test represents a design that is probably inappropriate. Most experiments are more complex and require a statistical design and analysis that suits the complexity.