EPSE 594: Meta-Analysis: Quantitative Research Synthesis

Similar documents
EPSE 594: Meta-Analysis: Quantitative Research Synthesis

Comparison of Different Methods of Detecting Publication Bias

Retrospective power analysis using external information 1. Andrew Gelman and John Carlin May 2011

How to interpret results of metaanalysis

A brief history of the Fail Safe Number in Applied Research. Moritz Heene. University of Graz, Austria

Systematic Reviews and Meta- Analysis in Kidney Transplantation

Welcome to this series focused on sources of bias in epidemiologic studies. In this first module, I will provide a general overview of bias.

FMEA AND RPN NUMBERS. Failure Mode Severity Occurrence Detection RPN A B

EXERCISE: HOW TO DO POWER CALCULATIONS IN OPTIMAL DESIGN SOFTWARE

Hypothesis-Driven Research

MS&E 226: Small Data

Overview. Survey Methods & Design in Psychology. Readings. Significance Testing. Significance Testing. The Logic of Significance Testing

UNCORRECTED PROOFS. Software for Publication Bias. Michael Borenstein Biostat, Inc., USA CHAPTER 11

Fixed Effect Combining

Critical Thinking A tour through the science of neuroscience

Choose an approach for your research problem

Meta-Analysis. Zifei Liu. Biological and Agricultural Engineering

Introduction to Meta-Analysis

Evidence-Based Medicine and Publication Bias Desmond Thompson Merck & Co.

Systematic Reviews. Simon Gates 8 March 2007

Performance of the Trim and Fill Method in Adjusting for the Publication Bias in Meta-Analysis of Continuous Data

Science is a way of learning about the natural world by observing things, asking questions, proposing answers, and testing those answers.

Chapter 23. Inference About Means. Copyright 2010 Pearson Education, Inc.

6. Unusual and Influential Data

Foundations of Research Methods

Pooling Subjective Confidence Intervals

Lec 02: Estimation & Hypothesis Testing in Animal Ecology

Power & Sample Size. Dr. Andrea Benedetti

How Does Analysis of Competing Hypotheses (ACH) Improve Intelligence Analysis?

VALIDITY OF QUANTITATIVE RESEARCH

Patrick Breheny. January 28

Chapter 7: Correlation

Pitfalls in Linear Regression Analysis

Confidence Intervals On Subsets May Be Misleading

Meta Analysis. David R Urbach MD MSc Outcomes Research Course December 4, 2014

Section on Survey Research Methods JSM 2009

CHAPTER ONE CORRELATION

Research Synthesis and meta-analysis: themes. Graham A. Colditz, MD, DrPH Method Tuuli, MD, MPH

Still important ideas

Day 11: Measures of Association and ANOVA

15.301/310, Managerial Psychology Prof. Dan Ariely Recitation 8: T test and ANOVA

Checking the counterarguments confirms that publication bias contaminated studies relating social class and unethical behavior

Recent developments for combining evidence within evidence streams: bias-adjusted meta-analysis

Still important ideas

An Understanding of Role of Heuristic on Investment Decisions

18/11/2013. An Introduction to Meta-analysis. In this session: What is meta-analysis? Some Background Clinical Trials. What questions are addressed?

EFFECTIVE MEDICAL WRITING Michelle Biros, MS, MD Editor-in -Chief Academic Emergency Medicine

PEER REVIEW HISTORY ARTICLE DETAILS VERSION 1 - REVIEW. Ball State University

Behavioral Finance 1-1. Chapter 5 Heuristics and Biases

Understanding Correlations The Powerful Relationship between Two Independent Variables

Objectives. Quantifying the quality of hypothesis tests. Type I and II errors. Power of a test. Cautions about significance tests

SAMPLING AND SAMPLE SIZE

Comparing Direct and Indirect Measures of Just Rewards: What Have We Learned?

Analysis of Covariance (ANCOVA)

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

A Brief Introduction to Bayesian Statistics

CHECKLIST FOR EVALUATING A RESEARCH REPORT Provided by Dr. Blevins

CHAPTER 15: DATA PRESENTATION

Experimental Psychology

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Reliability, validity, and all that jazz

The Pretest! Pretest! Pretest! Assignment (Example 2)

Organizing Data. Types of Distributions. Uniform distribution All ranges or categories have nearly the same value a.k.a. rectangular distribution

Using Statistical Intervals to Assess System Performance Best Practice

Supplementary notes for lecture 8: Computational modeling of cognitive development

Thresholds for statistical and clinical significance in systematic reviews with meta-analytic methods

Live WebEx meeting agenda

Exploring the Impact of Missing Data in Multiple Regression

January 2, Overview

Sanjay P. Zodpey Clinical Epidemiology Unit, Department of Preventive and Social Medicine, Government Medical College, Nagpur, Maharashtra, India.

Intro to Factorial Designs

School of Psychology. Professor Richard Kemp. Cognitive Bias: Psychology and Forensic Science

Further data analysis topics

Clinical research in AKI Timing of initiation of dialysis in AKI

ACT Aspire. Science Test Prep

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Fixed-Effect Versus Random-Effects Models

Validity and Quantitative Research. What is Validity? What is Validity Cont. RCS /16/04

Modeling and Environmental Science: In Conclusion

Stepwise method Modern Model Selection Methods Quantile-Quantile plot and tests for normality

Functionalist theories of content

How do we identify a good healthcare provider? - Patient Characteristics - Clinical Expertise - Current best research evidence

Chapter 02 Developing and Evaluating Theories of Behavior

Research Questions, Variables, and Hypotheses: Part 2. Review. Hypotheses RCS /7/04. What are research questions? What are variables?

JSM Survey Research Methods Section

Scientific Working Group on Digital Evidence

Making comparisons. Previous sessions looked at how to describe a single group of subjects However, we are often interested in comparing two groups

Thinking and Intelligence

EPSE 592: Design & Analysis of Experiments

Changes to NIH grant applications:

Technical Specifications

Minimizing Uncertainty in Property Casualty Loss Reserve Estimates Chris G. Gross, ACAS, MAAA

An Introduction to Bayesian Statistics

Human intuition is remarkably accurate and free from error.

Meta-Analysis and Publication Bias: How Well Does the FAT-PET-PEESE Procedure Work?

Reliability, validity, and all that jazz

Two-Way Independent ANOVA

Understanding Science Conceptual Framework

On the Use of Beta Coefficients in Meta-Analysis

Chapter 11. Experimental Design: One-Way Independent Samples Design

Transcription:

EPSE 594: Meta-Analysis: Quantitative Research Synthesis Ed Kroc University of British Columbia ed.kroc@ubc.ca March 14, 2019 Ed Kroc (UBC) EPSE 594 March 14, 2019 1 / 39

Last Time Synthesizing multiple study outcomes: independent vs. dependent outcomes The vote counting problem Power analysis in meta-analyses Case study: Kroesbergen & Van Luit (2003) Ed Kroc (UBC) EPSE 594 March 14, 2019 2 / 39

Today Recap Meta-analyzing multiple outcomes when you do not want to combine them Publication bias Case study: Taylor et al. (2019) Ed Kroc (UBC) EPSE 594 March 14, 2019 3 / 39

The vote counting problem There is a common and pernicious mis-methodology that one often encounters: the so-called vote counting problem. Vote counting occurs when a researcher simply looks at all the p-values corresponding to a test of null effect over many studies, and then counts how many are significant and how many are non-significant. A winner is then declared. Besides for obvious problems with potential publication bias, such a procedure is statistically incoherent (even if no publication bias was present). In fact, as the number of studies increases, the statistical power of a vote counting procedure tends to approach zero! Ed Kroc (UBC) EPSE 594 March 14, 2019 4 / 39

Power analysis in meta-analysis Intuitively, it seems reasonable to expect that a meta-analysis of many studies should have higher power than any of the individual studies. This heuristic is approximately true (under certain assumptions). If all studies really measure the same outcome over comparable samples with instruments that generate measurements of comparable reliability and validity, then the only major factor influencing power that changes between studies is sample size. But the sample size for the combined effect in a meta-analysis is always based upon a larger sample size than any one study; hence, greater power. Ed Kroc (UBC) EPSE 594 March 14, 2019 5 / 39

Power analysis in meta-analysis There are several problems with this naive reasoning in practice: Samples (sampling frames) are rarely directly comparable across studies. The effects of disparate samples can wreak havoc on power. Instruments of measurement are rarely identical across studies. Even if they are, differences in sampling frame can affect reliability and validity of any common measurements. Study design usually varies across studies. This will affect power in a nontrivial way. For a random effects meta-analysis, between study variation in true effect size can easily destroy overall power. All of these issues are nontrivial to account for. Ed Kroc (UBC) EPSE 594 March 14, 2019 6 / 39

Power analysis in meta-analysis The best way to perform power analysis for a meta-analysis is the same as always: (1) Make informed guesses about the factors affecting power: e.g. between-study variance, reliability of measurements, etc. (2) Estimate power (3) Conduct an extensive sensitivity analysis. See Borenstein, Chapter 29, for more tips on power analysis for meta-analysis. Ed Kroc (UBC) EPSE 594 March 14, 2019 7 / 39

Power analysis in meta-analysis Yet, meta-analyses tend to have greater power than single studies for many cases. One important application of this is in the study of adverse effects of treatment: When assessing the efficacy of a treatment(s), primary studies are usually designed (and powered) to study this potential effect only. In particular, primary studies are usually not designed to be able to detect negative side effects of treatment. Of course, these are of great clinical interest, so a common way to study the typical severity of adverse effects is via a meta-analysis. (Hopefully) higher power of the meta-analysis will allow for more reasonable estimation of potential adverse effects. Ed Kroc (UBC) EPSE 594 March 14, 2019 8 / 39

Synthesizing multiple study outcomes Sometimes we want to combine multiple outcomes into a single effect, e.g. for Zheng et al. (2016) what if we wanted to combine the English, reading, and writing outcomes, as well as the math and science outcomes? Then, the effect sizes are not independent. If we ignore this dependency, we will weaken the reliability of our meta-analyses. Why would we want to do this? Could make comparisons across studies more alike. For the Zheng et al. example, some studies may have investigated effects only for math classes, whereas other studies only for science classes. It may be natural (and more interpretable) to combine such effects. Can drastically increase our statistical power by combining outcomes. Ed Kroc (UBC) EPSE 594 March 14, 2019 9 / 39

Synthesizing multiple study outcomes There is (at least) one major problem with this process: it requires that we estimate the correlation ρ between two constituent outcomes. But this is rarely reported/known! Instead, people will try to guess a reasonable value for the correlation from the previous literature, and then perform a sensitivity analysis: Perform the meta-analysis with the proposed correlation. Then change this estimate and rerun the meta-analysis to see if any substantive conclusions change. Typically, this means we should vary the correlation within a plausible range, then perform meta-analyses for many of the values in this range to see if/how conclusions change. Ed Kroc (UBC) EPSE 594 March 14, 2019 10 / 39

Synthesizing multiple study outcomes Alternatively, we could try ignoring the dependency between constituent outcomes and treat them as independent. But this will underestimate the variance (overestimate the precision) in our meta-analysis! This is a manifestation of the same problem of treating repeated measurements as independent. There are special cases where this problem is minimized (or even absent); see Chapter 25 of Borenstein. My advice: avoid constructing composite effect sizes. Meta-analyses require so many hard-to-validate assumptions already, do not require any more! If you want to create composite effects because your meta-analysis is so low powered, then you probably shouldn t be conducting a meta-analysis anyway. Ed Kroc (UBC) EPSE 594 March 14, 2019 11 / 39

Analyzing multiple study outcomes separately If we do not want to combine multiple study outcomes, then the best option is usually to perform separate meta-analyses for each outcome of interest. Practically, the small sample size (# of studies) will make other alternatives infeasible. This was what Zheng et al. 2016 did with their 5 separate meta-analyses for each of five different school subjects: English, reading, writing, math, and science. Ed Kroc (UBC) EPSE 594 March 14, 2019 12 / 39

Analyzing multiple study outcomes separately Major problem 1: Potential double-dipping (e.g. using the same data about a moderator in each of the separate meta-analyses Solution: adjust for inflated Type I error rate: Bonferroni (most common, most conservative, least powerful) Holm-Sidák (fairly common, more powerful) False discovery rate adjustment (least common in social sciences, most powerful, but changes interpretation) And others... Ed Kroc (UBC) EPSE 594 March 14, 2019 13 / 39

Analyzing multiple study outcomes separately Major problem 2: Different outcome measures often have a strong (and complex) dependence structure. Solution: NONE The only way to solve this problem is to quantify/model the dependence explicitly: but this requires sufficient sample size. Affect of ignoring this: can drastically kill your power. Affect of ignoring this: can substantially alter the quality (accuracy and/or precision) of your point estimates. In practice, usually need to just accept this limitation and remember to always interpret the outcome of a meta-analysis (or any small sample size analysis) tentatively. Ed Kroc (UBC) EPSE 594 March 14, 2019 14 / 39

Publication bias Publication bias occurs when your sample of studies is not actually representative of the research that has been done (or that could hypothetically be done). Put another way, PB occurs when the missing studies are systematically different than the located studies. Ed Kroc (UBC) EPSE 594 March 14, 2019 15 / 39

Publication bias Types of publication bias: Significance bias or File Drawer Problem (studies that find statistical significance of an effect are more likely to be published/reported) Effect size bias (studies that find large effects are more likely to be published) Academic journal bias (only searching for and/or including studies that appear in peer-reviewed journals) Language bias (studies written in English are more likely to be included) Availability bias (easier to locate studies that are more readily available/visible) Familiarity bias (including studies only from one s own subdiscipline, or favourite methodology, etc.) Duplication bias (studies with impressive or even just statistically significant results are more likely to be published more than once) Citation bias (higher citation rates means easier to find) Ed Kroc (UBC) EPSE 594 March 14, 2019 16 / 39

Significance bias It has been well established that statistically significant results are far more likely to be published (and at a greater rate) than studies that do not find statistical significance (see Dickersin 2005). This drives p-hacking, ad hoc hypothesis selection, and other abuses of statistics, muddying the literature and our understanding of our fields. Some measures are being taken in some disciplines to combat this issue, but this issue will not be going away. Ed Kroc (UBC) EPSE 594 March 14, 2019 17 / 39

Significance bias Ed Kroc (UBC) EPSE 594 March 14, 2019 18 / 39

Effect size bias It has also been well established that it is easier to publish results with larger effect sizes. This phenomenon is particularly pronounced in studies with small sample sizes. Why? Answer: Type M error (Gelman & Carlin 2014) Mathematically, if your study is underpowered, then if you find a statistically significant result, the corresponding effect size must be exaggerated. Ed Kroc (UBC) EPSE 594 March 14, 2019 19 / 39

Effect size bias This is a graphical representation of a t-test comparison of means. The statistical power here is 6%. In this example, true effect size (marked by blue line) is very small. Red regions represent values for significant test statistics (and so, p-values) Ed Kroc (UBC) EPSE 594 March 14, 2019 20 / 39

Effect size bias But then finding a significant result would mean: the estimated effect size is at least 9 times too big (Type M error)! the estimated effect size has the wrong sign about 25% of the time (Type S error)! [See Gelman & Carlin (2014) for more info.] Ed Kroc (UBC) EPSE 594 March 14, 2019 21 / 39

Effect size bias In low-powered studies: Searching for significance will actually hurt your estimates and inferences. Significant results produce estimates that are wildly inaccurate. Seemingly small things like measurement error, sampling variability, or minor experimental imperfections become magnified. Results are often entirely driven by statistical noise. Ed Kroc (UBC) EPSE 594 March 14, 2019 22 / 39

Effect size bias Upshot for meta-analysis: Downweight studies with low power! Recall: Fixed-effects model will naturally accomplish this...... but random/mixed-effects model does not! Study weights are much more similar in these models when using a traditional inverse-variance weighting. What to do? Use a different weighting! Recall: our derived formulas for combined average effect size, CIs, PIs, etc. all allowed for generic weights. So manually downweight low-powered studies in random/mixed-effects meta-analyses. Ed Kroc (UBC) EPSE 594 March 14, 2019 23 / 39

Correcting for publication bias Many ways have been proposed to formally check and correct for publication bias in a meta-analysis. Most common: Fail-safe N statistics Funnel plots Trim and fill procedures Cumulative meta-analysis All these procedures are based on certain model assumptions, so no one procedure is automatically best. Ed Kroc (UBC) EPSE 594 March 14, 2019 24 / 39

Fail-safe N statistics Fail-safe N statistics all exploit the following idea: if we recovered all missing studies and included them in the meta-analysis, the p-value for the summary effect will likely be greater; i.e. less significant. Note: this assumes that including missing studies would drag significance down. A reasonable assumption under some forms of publication bias, but not necessarily others (e.g. academic journal bias) Assuming this though, how many missing studies would we need to find in order to find a statistically non-significant summary effect? This number (and related numbers) are called fail-safe N statistics. Ed Kroc (UBC) EPSE 594 March 14, 2019 25 / 39

Fail-safe N statistics The oldest and worst fail-safe N statistic is due to Rosenthal: this statistic gives the number of missing studies we would need to include in a meta-analysis to obtain a zero summary effect. Several problems with this: (1) Focuses on statistical significance rather than clinical significance. (2) The math assumes that the effect sizes in the missing studies are all zero; in reality, many of these effect sizes are likely negative, and many may be positive. (3) The math combines p-values, rather than combining effect sizes and then calculating a new p-value: never do this. In Rosenthal s time, many didn t yet realize that synthesizing p-values is statistically incoherent. Do not ever use Rosenthal s fail-safe N statistic in practice. Ed Kroc (UBC) EPSE 594 March 14, 2019 26 / 39

Fail-safe N statistics Orwin s statistic fixes these problems: this statistic gives the number of missing studies we would need to include in a meta-analysis to obtain a targeted summary effect size. Orwin s N is better than Rosenthal because: (1) Removes focus on statistical significance by allowing you to specify what targeted effect size you would deem of clinical interest. (2) Allows you to specify what the average effect size in the missing studies should be (specifies a normally distributed random effect). (3) Synthesizes effect sizes rather than p-values. Ed Kroc (UBC) EPSE 594 March 14, 2019 27 / 39

Fail-safe N statistics Orwin s N will usually be much lower than Rosenthal s N, and is more useful. But it has problems: (1) Assumes effect sizes are combined with equal weights. (2) Assumes a simplistic mechanism of publication bias. In R (metafor), the default targeted effect size is set to half of the observed summary effect size. But this can be manually overridden. Ed Kroc (UBC) EPSE 594 March 14, 2019 28 / 39

Fail-safe N statistics A final alternative is Rosenberg s fail-safe N: gives the number of missing studies we would need to include in a meta-analysis to obtain a targeted p-value (default is 0.05). Rosenberg assumes a fixed effects model for missing studies. Rosenberg s N is usually between Orwin s and Rosenthal s N. My advice: can consider Orwin s N, but do not take it too seriously. Ignore the others. Ed Kroc (UBC) EPSE 594 March 14, 2019 29 / 39

Funnel plots A useful visual tool to diagnose publication bias is a funnel plot: Plot each study s outcome effect size against its standard error. If the scatter of points is a symmetric blob around the summary effect size, then no evidence of significance bias. If the scatter of points trails off to the right, then we have evidence of significance bias. Note: typical to draw a triangle (funnel) around the scatterplot of points: triangle is centred at summary effect, and has vertex angle defined by the 95% CI of the summary effect. Ed Kroc (UBC) EPSE 594 March 14, 2019 30 / 39

Funnel plots Funnel plot for Zheng et al. (2016): no evidence of PB Ed Kroc (UBC) EPSE 594 March 14, 2019 31 / 39

Funnel plots Funnel plot for hypothetical meta-analysis: some evidence of PB Ed Kroc (UBC) EPSE 594 March 14, 2019 32 / 39

Funnel plots Many formal statistical test have been proposed to try to objectively quantify the amount of PB evidenced by a funnel plot...... but never use these unless you are meta-analyzing many studies. These tests all have low power, even with moderate sample sizes. Ed Kroc (UBC) EPSE 594 March 14, 2019 33 / 39

Trim and fill procedures If we see evidence of PB in the funnel plot, then how can we adjust for it? Most common procedure is trim and fill: Iteratively remove each study furthest to the lower-right (biggest effect times standard error) until no more evidence of PB is present; Compute the new summary effect; To ensure we don t artificially deflate uncertainty, add the removed studies back in, and also add their mirror images on the opposite side of the new summary effect; Now we have an unbiased estimate of the summary effect and a semi-reasonable estimate of its uncertainty. Ed Kroc (UBC) EPSE 594 March 14, 2019 34 / 39

Trim and fill procedures Trim... Ed Kroc (UBC) EPSE 594 March 14, 2019 35 / 39

Trim and fill procedures... and fill Ed Kroc (UBC) EPSE 594 March 14, 2019 36 / 39

Trim and fill procedures Trim and fill is a nice technique, but it comes with some caveats: The actual algorithm that does the trimming is prone to perform poorly when there are too few studies, or too many aberrant studies The fill algorithm relies on imputation to create the missing effect sizes: this comes with a host of other modelling assumptions that we will not be able to test for a meta-analysis. The technique does not explicitly consider Type M error. Use trim-and-fill to see if your substantive conclusions change; if they do, then should attempt to find the source of PB and adjust for it directly. Ed Kroc (UBC) EPSE 594 March 14, 2019 37 / 39

Cumulative meta-analysis Will discuss next time! Ed Kroc (UBC) EPSE 594 March 14, 2019 38 / 39

Case study Case study: Taylor et al. (2019) - take notes on our discussion! Ed Kroc (UBC) EPSE 594 March 14, 2019 39 / 39