(b) empirical power. IV: blinded IV: unblinded Regr: blinded Regr: unblinded α. empirical power

Similar documents
MEA DISCUSSION PAPERS

EC352 Econometric Methods: Week 07

Dichotomizing partial compliance and increased participant burden in factorial designs: the performance of four noncompliance methods

Understandable Statistics

Instrumental Variables Estimation: An Introduction

Biostatistics. Donna Kritz-Silverstein, Ph.D. Professor Department of Family & Preventive Medicine University of California, San Diego

Chapter 1: Exploring Data

12.1 Inference for Linear Regression. Introduction

Bayesian graphical models for combining multiple data sources, with applications in environmental epidemiology

Supporting Information

WDHS Curriculum Map Probability and Statistics. What is Statistics and how does it relate to you?

Consequences of effect size heterogeneity for meta-analysis: a Monte Carlo study

Examining Relationships Least-squares regression. Sections 2.3

Citation for published version (APA): Ebbes, P. (2004). Latent instrumental variables: a new approach to solve for endogeneity s.n.

SUPPLEMENTARY INFORMATION

Unit 1 Exploring and Understanding Data

Still important ideas

Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F

Designed Experiments have developed their own terminology. The individuals in an experiment are often called subjects.

SPRING GROVE AREA SCHOOL DISTRICT. Course Description. Instructional Strategies, Learning Practices, Activities, and Experiences.

Empirical Tools of Public Finance. 131 Undergraduate Public Economics Emmanuel Saez UC Berkeley

9 research designs likely for PSYC 2100

Estimating Direct Effects of New HIV Prevention Methods. Focus: the MIRA Trial

T-Statistic-based Up&Down Design for Dose-Finding Competes Favorably with Bayesian 4-parameter Logistic Design

Readings: Textbook readings: OpenStax - Chapters 1 11 Online readings: Appendix D, E & F Plous Chapters 10, 11, 12 and 14

Example 7.2. Autocorrelation. Pilar González and Susan Orbe. Dpt. Applied Economics III (Econometrics and Statistics)

AP Statistics. Semester One Review Part 1 Chapters 1-5

Mendelian Randomization

Cross-Lagged Panel Analysis

Mendelian Randomisation and Causal Inference in Observational Epidemiology. Nuala Sheehan

Mark J. Koetse 1 Raymond J.G.M. Florax 2 Henri L.F. de Groot 1,3

Still important ideas

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Dylan Small Department of Statistics, Wharton School, University of Pennsylvania. Based on joint work with Paul Rosenbaum

10-1 MMSE Estimation S. Lall, Stanford

Chapter 8 Statistical Principles of Design. Fall 2010

Clinical trial design issues and options for the study of rare diseases

Content. Basic Statistics and Data Analysis for Health Researchers from Foreign Countries. Research question. Example Newly diagnosed Type 2 Diabetes

The Effects of Autocorrelated Noise and Biased HRF in fmri Analysis Error Rates

Methods for Addressing Selection Bias in Observational Studies

Principles of Experimental Design

Principles of Experimental Design

F1: Introduction to Econometrics

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Nature Neuroscience: doi: /nn Supplementary Figure 1. Task timeline for Solo and Info trials.

Econometric Game 2012: infants birthweight?

Case study examining the impact of German reunification on life expectancy

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions

Multiple Regression Analysis

Propensity scores: what, why and why not?

The wicked learning environment of regression toward the mean

Statistical Science Issues in HIV Vaccine Trials: Part I

Multilevel analysis quantifies variation in the experimental effect while optimizing power and preventing false positives

The Pretest! Pretest! Pretest! Assignment (Example 2)

Assessing Studies Based on Multiple Regression. Chapter 7. Michael Ash CPPA

Variation in Measurement Error in Asymmetry Studies: A New Model, Simulations and Application

2.75: 84% 2.5: 80% 2.25: 78% 2: 74% 1.75: 70% 1.5: 66% 1.25: 64% 1.0: 60% 0.5: 50% 0.25: 25% 0: 0%

Assignment #6. Chapter 10: 14, 15 Chapter 11: 14, 18. Due tomorrow Nov. 6 th by 2pm in your TA s homework box

Applied Econometrics for Development: Experiments II

Business Statistics Probability

6. Unusual and Influential Data

Brief introduction to instrumental variables. IV Workshop, Bristol, Miguel A. Hernán Department of Epidemiology Harvard School of Public Health

Reliability of Ordination Analyses

Introduction to Applied Research in Economics Kamiljon T. Akramov, Ph.D. IFPRI, Washington, DC, USA

Ec331: Research in Applied Economics Spring term, Panel Data: brief outlines

An Instrumental Variable Consistent Estimation Procedure to Overcome the Problem of Endogenous Variables in Multilevel Models

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions

Chapter 13 Estimating the Modified Odds Ratio

Using register data to estimate causal effects of interventions: An ex post synthetic control-group approach

Studying the effect of change on change : a different viewpoint

Time-varying confounding and marginal structural model

Russian Journal of Agricultural and Socio-Economic Sciences, 3(15)

SUPPLEMENTAL MATERIAL

George B. Ploubidis. The role of sensitivity analysis in the estimation of causal pathways from observational data. Improving health worldwide

An Introduction to Bayesian Statistics

Write your identification number on each paper and cover sheet (the number stated in the upper right hand corner on your exam cover).

STATISTICS & PROBABILITY

Simple Linear Regression the model, estimation and testing

What is Multilevel Modelling Vs Fixed Effects. Will Cook Social Statistics

Biostatistics II

Application of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties

Final Exam - section 2. Thursday, December hours, 30 minutes

Treatment effect estimates adjusted for small-study effects via a limit meta-analysis

Evidence Based Medicine

The Impact of Relative Standards on the Propensity to Disclose. Alessandro Acquisti, Leslie K. John, George Loewenstein WEB APPENDIX

Psychology Research Process

Can you guarantee that the results from your observational study are unaffected by unmeasured confounding? H.Hosseini

Biost 524: Design of Medical Studies

On the use of the outcome variable small for gestational age when gestational age is a potential mediator: a maternal asthma perspective

Pooling Subjective Confidence Intervals

Propensity Score Methods to Adjust for Bias in Observational Data SAS HEALTH USERS GROUP APRIL 6, 2018

Early Learning vs Early Variability 1.5 r = p = Early Learning r = p = e 005. Early Learning 0.

Analysis of Vaccine Effects on Post-Infection Endpoints Biostat 578A Lecture 3

Quantitative Methods in Computing Education Research (A brief overview tips and techniques)

STATISTICAL CONCLUSION VALIDITY

Sample size and power calculations in Mendelian randomization with a single instrumental variable and a binary outcome

Supplement to SCnorm: robust normalization of single-cell RNA-seq data

Lecture 14: Adjusting for between- and within-cluster covariates in the analysis of clustered data May 14, 2009

Class 1: Introduction, Causality, Self-selection Bias, Regression

Transcription:

Supplementary Information for: Using instrumental variables to disentangle treatment and placebo effects in blinded and unblinded randomized clinical trials influenced by unmeasured confounders by Elias Chaibub Neto Supplementary Figures placebo, conf., β = 0, ψ 0 placebo, conf., β 0, ψ 0 IV: blinded IV: unblinded Regr: blinded Regr: unblinded IV: blinded IV: unblinded Regr: blinded Regr: unblinded placebo, unconf., β = 0, ψ 0 IV: blinded IV: unblinded Regr: blinded Regr: unblinded placebo, unconf., β 0, ψ 0 IV: blinded IV: unblinded Regr: blinded Regr: unblinded Supplementary Figure 1. Empirical power to detect placebo effects in the blinded and unblinded settings. Panels a and b show the results in the presence of confounders, whereas panels c and d show the results in their absence. The regression approach (brown and dark-orange) were considerably better powered than the IV approaches (blue and red) in the presence of confounders (panels a and b), but only slightly better powered in the absence of confounders (panels c and d). Both regression and IV approaches showed similar power under the blinded and unblinded settings. 1/17

treat., conf., blinded, β 0, ψ = 0 treat., conf., blinded, β 0, ψ 0 IV: no placebo adjustment IV: adjusted by est. placebo IV: no placebo adjustment IV: adjusted by est. placebo treat., unconf., blinded, β 0, ψ = 0 IV: no placebo adjustment IV: adjusted by est. placebo treat., unconf., blinded, β 0, ψ 0 IV: no placebo adjustment IV: adjusted by est. placebo Supplementary Figure 2. Empirical power for detecting treatment effects in the blinded setting. The regression approach (brown) tends to be better powered than the IV approaches in the presence of confounders (panels a and b), but only slightly better in the absence of confounding (panels c and d). The two-step IV approach (blue) tends to be better powered than the non-adjusted one (red) in the presence of placebo effects (panels b and d), but both IV approaches tend to be comparable in absence of placebo effects (panels a and c). 2/17

treat., conf., unblinded, β 0, ψ = 0 treat., conf., unblinded, β 0, ψ 0 IV: no placebo adjustment IV: adjusted by est. placebo IV: adjusted by true placebo IV: no placebo adjustment IV: adjusted by est. placebo IV: adjusted by true placebo treat., unconf., unblinded, β 0, ψ = 0 IV: no placebo adjustment IV: adjusted by est. placebo IV: adjusted by true placebo treat., unconf., unblinded, β 0, ψ 0 IV: no placebo adjustment IV: adjusted by est. placebo IV: adjusted by true placebo Supplementary Figure 3. Empirical power for detecting treatment effects in the unblinded setting. The regression approach (brown) tended to be better powered than the IV approaches in the presence of confounders (panels a and b), but comparable in the absence of confounding (panels c and d). The two-step IV approach (blue) tended to be slightly better powered than the non-adjusted one (red) in the presence of placebo effect (panel b), but comparable in the other panels. 3/17

blinded IV: adjust. (est.) unblinded IV: adjust. (est.) IV: adjust. (true) 3 3 β β^ 3 3 β β^ Supplementary Figure 4. Comparison of the bias of the regression and IV estimators. Panels a and b show the densities of the difference between true and estimated treatment effects, β ˆβ, in the blinded and unblinded settings, respectively. In both settings we observed larger bias in the regression estimates, in comparison to the IV approaches, as illustrated by the heavier tails of the brown densities. 4/17

placebo effect treatment effect cor(q, M) < 0.1 0.1 <= cor(q, M) < 0.2 0.2 <= cor(q, M) < 0.3 0.3 <= cor(q, M) < 0.4 cor(q, M) >= 0.4 cor(x, Z) < 0.2 0.2 <= cor(x, Z) < 0.3 0.3 <= cor(x, Z) < 0.4 0.4 <= cor(x, Z) < 0.5 0.5 <= cor(x, Z) < 0.6 0.6 <= cor(x, Z) < 0.7 0.7 <= cor(x, Z) < 0.8 cor(x, Z) >= 0.8 cor(q, M) cor(z, X) Density 0 1 2 3 4 5 Density 0 1 2 3 4 5 0.2 correlation 0.2 correlation Supplementary Figure 5. Empirical power curves stratified by strength of association with the IV variable. Panel a shows the power curves for the placebo effect IV estimator ˆψ, stratified according to the correlation between Q and M (panel c shows the distribution of the correlation between Q and M across all simulations used to construct the power curves in panel a). Panel b shows the power curves for the two-step treatment effect IV estimator ˆβ, stratified according to the correlation between Z and X (panel d shows the distribution of the correlation between Z and X over the simulations used to estimate the power curves in panel b). Results based on blinded and unblinded simulations influence by confounders. 5/17

placebo, confounded, unblinded, β 0, ψ 0 placebo, confounded, unblinded, β 0, ψ 0 without adjust. for meas. confounders with adjust. for meas. confounders without adjust. for meas. confounders with adjust. for meas. confounders Supplementary Figure 6. Empirical power comparison of the IV estimator of placebo effects, with and without adjustment for measured confounders. In order to illustrate how the adjustment for measured confounders can improve the power to detect a causal effect, we show how the use of the treatment variable as a measured confounder of the placebo effect can improve the power to detect placebo effects. Note that in unblinded trials, X corresponds to a measured confounder of the placebo effect ψ, since it influences both M (via E) and Y, as illustrated in Figure 1 in the main text. Panel a shows the comparison for data simulated under the unblinded and confounded setting in the presence of treatment and placebo effects, as described in the main text. In this case, adjustment for X seems to provide a marginal improvement in the power to detect placebo effects. Panel b shows the comparison for data simulated under the same specifications as in panel a, but using stronger effects of X on Y and of X on M (via stronger effects of X on E, and of E on M). In this situation, adjustment for X does improve the power to detect placebo effects by a considerably larger margin. 6/17

Supplementary Note Performance evaluation when the emotion level variable is influenced by measurement error In the following additional simulation studies, involving measurement error, we focus only on the unblinded/unconfounded and on the unblinded/confounded simulation settings in the presence of both placebo and treatment effects (i.e., β 0 and ψ 0). For each simulated data set we first generate data for the (Z,X,Q,M,Y ) variables (in exactly the same manner as the respective simulations presented in the main text), but then generate a new emotion level variable, M, by introducing measurement error (ME) on the original emotion level variable, M, according to the linear model, M = M + ε ME, where ε ME N(0,σME 2 ). (Note that we still use the perfectly measurement value, M, in the generation of Y, but run our analysis using the (Z, X, Q, M,Y ) data.) We consider three distinct levels of measurement error, the moderate ME setting, where σme 2 = 5, the high ME setting, with σ ME 2 = 25, and the extreme ME setting, where σ ME 2 = 100. Figures 1 and 2 illustrate that these particular choices indeed represent moderate, high, and extreme levels of ME, by comparing the variance of the emotion levels in the original data (no ME) to the variance of the emotion levels generated under the influence of these amounts of ME. In total, we considered 6 additional simulation experiments, each one encompassing 10,000 distinct data sets. Panels a, b, and c of Figure 3 present, respectively, the empirical bias, ψ ˆψ, of the placebo effect estimator for moderate, high, and extreme levels of ME, for data simulated in the absence of confounders. The panels clearly show that the IV estimator (blue) produces considerably less biased placebo effect estimates than the regression estimator (brown). This result is not surprising given that the IV estimator is able to handle measurement error in explanatory variables (as a matter of fact, it has been pointed out 1 that the original motivation for the development of IV methods in the econometrics field was to account for measurement error in the explanatory variables, and that only later IVs have been used to account for unmeasured confounders). To see how the placebo effect IV estimator is able to account for ME in the explanatory variable M observe that (at least for reasonably large sample sizes), ˆψ IV = Ĉov(Q,Y ) Ĉov(Q,M) Ĉov(Q,Y ) Ĉov(Q, M), (1) since Ĉov(Q,M) and Ĉov(Q, M) are consistent estimators of Cov(Q,M) and Cov(Q, M) and, Cov(Q, M) = Cov(Q, M + ε ME ) = Cov(Q,M) + Cov(Q,ε ME ) = Cov(Q,M), (2) where the last equality follows from the fact that Cov(Q,ε ME ) = 0 (since Q is randomized), so that Ĉov(Q,M) Ĉov(Q, M) in reasonably large sample sizes. Panels d, e, and f of Figure 3 present the distributions of the placebo effects estimates generated by the IV (blue) and regression (brown) estimators along with the distributions of the true placebo effect values (black) used in the simulation of the synthetic data. These panels clarify that the peculiar shape of the brown densities in the top panels is explained by the fact that the placebo effect estimates generated by regression tend to get more concentrated around zero, as the amount of measurement error increases, so that the bias distribution ends up closely approximating the distribution of ψ (compare the brown density in panel c to the black density in panel d of Figure 3). On the other hand, the distributions of the estimates generated by the IV estimator tend to be much closer to the distribution of the true placebo effects (note the similarity between the blue and black densities). Figure 3, also show that the amount of bias tends to increase with increasing amounts of ME, as one would expect. Figure 4 presents the results for data simulated in the presence of confounders. We observe essentially the same patterns, except that the results are not as clear cut as in the unconfounded case (note the different scales in the y-axis in comparison to Figure 3). This observation is not surprising given that the presence of confounding makes statistical inferences more challenging. Figure 5 presents the curves for the placebo effect estimators for both the unconfounded (panel a) and confounded (panel b) cases. Panel a shows that in the absence of confounding, the IV estimators tended to show high empirical power independent of the amount of ME used in the simulations, whereas the power of the regression approach tended to decrease with increasing amounts of ME. Panel b, on the other hand, show that the IV estimator tended to be less powered than the regression one in the presence of confounding. We point out, however, that the high achieved by the regression estimator in the presence of confounding seems to be an artifact of the highly biased estimates produced by the regression approach, as illustrated in the top panels of Figure 4. It is interesting to note that the randomization test based on the placebo effect IV estimator is robust to measurement error, as illustrated by the fact that the power curves for the distinct amounts of ME (full, dashed, and dotted blue lines in Figure 5) lay on top of each other. To see why this is the case, recall that in the generation of the randomization null we only shuffle the 7/17

response data, Y, relative to the the instrument and emotion level data, (Q,M), whose association is kept intact. Hence, while the numerator of the IV estimator, Ĉov(Q,Y ), changes with distinct data shufflyings, the denominator, Ĉov(Q,M), is constant across all shufflyings used to generate the randomization null. Therefore, it follows that even though the randomization null distributions based on the statistics, Ĉov(Q,Y ) Ĉov(Q,M), Ĉov(Q,Y ) Ĉov(Q, M), (3) have different spreads (since the distinct but constant denominators scale the identical numerators differently), they still represent a simple re-scaling of each other, and generate the exact same p-values (as long as we use the same permutations of the response data in the generation of the randomization null distributions). Hence, the randomization test generates the same conclusion for an analysis based on perfectly measured emotion levels as in an analysis based on emotion data affected by an arbitrary amount of measurement error. Figure 6 shows an illustrative example. Figure 7 presents the distributions of the treatment effect bias, β ˆβ (top panels), and treatment effect estimates (bottom panels) for data simulated under the influence of moderate, high, and extreme levels of ME, in the absence of confounding. In addition to the regression estimator, ˆβR (brown), and placebo adjusted IV estimator, ˆβ2sIV = Ĉov(Z, ˆR)/Ĉov(Z,X) (blue), the figure also report results for the unadjusted IV estimator, ˆβ IV = Ĉov(Z,Y )/Ĉov(Z,X) (red). The simulation results show that the ˆβ 2sIV estimator tended to outperform ˆβ IV and ˆβ R in the presence of measurement error, although the decreases in bias achieved by ˆβ 2sIV tended to be less accentuated in comparison to the decreases observed for the placebo effect estimator (note the different scales in the y-axis of Figure 7 in comparison to Figure 3). This observation is also not surprising since the estimation of ψ is never free from noise, and, even though ˆψ IV is able to reduce the additional bias induced by ME, it cannot completely neutralize it. Hence, the placebo effect estimates obtained in the presence of ME, and employed in the computation of the residuals, R = Y ˆψ IV M (used in the estimation of the ˆβ 2sIV ), tend to be less effective in removing the influence of the placebo effect on the outcome variable. Figure 8 presents the respective results in the presence of confounding. It shows that the presence of both confounding and measurement error makes the statistical inference considerably more challenging. Panel a shows that the ˆβ 2sIV etimator (blue) produces less biased estimates than ˆβ IV (red) and ˆβ R (brown) in the presence of moderate amounts of measurement error. Panel b shows that the bias density of the ˆβ 2sIV estimator is slightly more peaked around 0, but also have heavier tails than the densities of the ˆβ IV (red) and ˆβ R (brown) estimators. Panel c, however, clearly shows that the ˆβ 2sIV estimator can be more biased than the other estimators in the presence of confounding and extreme amounts of measurement error (note how the blue density puts less mass around 0, and have considerably heavier tails than the red and brown densities). The likely reason is that, in the presence of extreme amounts of ME, the ˆψ IV estimator can still be fairly biased (as illustrated in panel c of Figure 4), so that adjustments based on a highly biased estimates of ψ end up harming the estimation of the treatment effect. We point out, nonetheless, that our evaluations included such extreme amounts of ME for illustrative purposes (since we wanted to investigate how much ME ˆβ 2sIV would be able to handle, before it started performing worse than ˆβ IV ). However, in reality, such extreme cases are not very likely to be found in practice. (Recall that the emotion level variable was, on average, 14 times more variable than the original data free from ME in the extreme ME setting). Figure 9 presents the curves for the treatment effect estimators for both the unconfounded (panel a) and confounded (panel b) cases. In both panels the regression and unadjusted IV estimators tended to be higher powered than the placebo adjusted IV estimator (whose power tended to decrease with increasing amounts of ME). Finally, for the sake of completeness, we also evaluate the type I error rates of the procedures in the presence of ME, using a few additional simulation experiments focusing again in the unblinded/unconfounded and on the unblinded/confounded settings, but in the absence of placebo and treatment effects (i.e., β = 0 and ψ = 0). Panels a and b of Figure 10 present the results for the placebo effect tests in the unconfounded and confounded cases, respectively, and show that the error rates of the IV approach (blue) are still controlled at the exact nominal levels in the presence of measurement error, whereas the regression approach (brown) still shows inflated error rates in the presence of confounding. Panels c and d present the respective results for the treatment effect tests. Both panels show well controlled error rates for the unadjusted IV approach (red), but slightly inflated error rates for the placebo adjusted IV approach (blue). The regression approach (brown), on the other hand, shows well controlled error rates in the absence of confounding (panel c) but inflated errors in the presence of confounding (panel d). References 1. Angrist, J. & Krueger, A. Instrumental variables and the search for identification: from supply and demand to natural experiments. Journal of Economic Perspectives, 15, 69-85 (2001). 8/17

0 100 300 500 700 2 no ME (σ ME = 0) 0 200 400 600 800 1000 2 moderate ME (σ ME = 5) 0 200 400 600 800 1000 2 high ME (σ ME = 25) 0 200 400 600 800 1000 2 extreme ME (σ ME = 100) 2 3 4 5 6 variance of the emotion level 6 8 10 12 variance of the emotion level 20 25 30 35 40 variance of the emotion level 60 80 100 120 140 variance of the emotion level (e) moderate ME / no ME (f) high ME / no ME extreme ME / no ME (g) 0 200 400 600 0 100 200 300 400 500 600 0 200 400 600 800 1.5 2.0 2.5 3.0 3.5 4.0 ratio of variances 4 6 8 10 12 14 ratio of variances 20 30 40 50 ratio of variances Figure 1. Comparison of the emotion level variances under varying levels of measurement error in the absence of confounding. Panel a shows the distribution of estimates of the variance of the emotion level across 10,000 synthetic data sets simulated (without measurement error) under the unblinded/unconfounded setting, in the presence of treatment and placebo effects. Panels b, c, and d show the distributions of the variance estimates after the addition of moderate, high and extreme amounts of measurement error to the original data, respectively. Panel e shows the distribution of the ratio of the variances for data generated under moderate ME relative to the original data. Note that the mean is around 2.4, meaning that, on average, the data generated with ME was 2.4 times more variable than the original data, showing that the choice σme 2 = 5 induces a moderate amount of ME in the data. Panel f shows the respective distribution for data generated under high ME relative to the original data. In this case, the data generated with measurement error was on average 8 times more variable than the original data, showing that the choice σme 2 = 25 indeed leads to high ME levels. Panel g shows the distribution of the ratio of the variances for the comparison of extreme ME relative to the original data. On average, the data generated with ME was 29 times more variable than the original data, showing that the choice σme 2 = 100 leads to extreme ME levels in the data. 9/17

0 100 200 300 400 500 600 2 no ME (σ ME = 0) 0 100 200 300 400 500 600 2 moderate ME (σ ME = 5) 0 200 400 600 800 1000 2 high ME (σ ME = 25) 0 200 400 600 800 1000 2 extreme ME (σ ME = 100) 5 10 15 20 variance of the emotion level 10 15 20 25 30 variance of the emotion level 20 25 30 35 40 45 50 55 variance of the emotion level 80 100 120 140 160 variance of the emotion level 0 200 400 600 800 (e) moderate ME / no ME 0 200 400 600 800 (f) high ME / no ME 0 200 400 600 800 1000 (g) extreme ME / no ME 1.0 1.5 2.0 2.5 3.0 3.5 ratio of variances 2 4 6 8 10 12 ratio of variances 10 20 30 40 50 ratio of variances Figure 2. Comparison of the emotion level variances under varying levels of measurement error in the presence of confounding. Panel a shows the distribution of estimates of the variance of the emotion level across 10,000 synthetic data sets simulated (without measurement error) under the unblinded/confounded setting, in the presence of treatment and placebo effects. Panels b, c, and d show the distributions of the variance estimates after the addition of moderate, high and extreme amounts of measurement error to the original data, respectively. Panel e shows the distribution of the ratio of the variances for data generated under moderate ME relative to the original data. Note that the mean is around 1.6, meaning that, on average, the data generated with ME was 1.6 times more variable than the original data, showing that the choice σme 2 = 5 induces a moderate amount of ME in the data. Panel f shows the respective distribution for data generated under high ME relative to the original data. In this case, the data generated with measurement error was on average 4 times more variable than the original data, showing that the choice σme 2 = 25 indeed leads to high ME levels. Panel g shows the distribution of the ratio of the variances for the comparison of extreme ME relative to the original data. On average, the data generated with ME was 14 times more variable than the original data, showing that the choice σme 2 = 100 leads to extreme ME levels in the data. 10/17

0 1 2 3 4 unconfounded, moderate ME IV bias 0 1 2 3 4 unconfounded, high ME IV bias 0 1 2 3 4 unconfounded, extreme ME IV bias 0 1 2 3 4 ψ ψ^ unconfounded, moderate ME True placebo effect IV estimate 0 1 2 3 4 (e) ψ ψ^ unconfounded, high ME True placebo effect IV estimate 0 1 2 3 4 (f) ψ ψ^ unconfounded, extreme ME True placebo effect IV estimate ψ^ ψ^ ψ^ Figure 3. Empirical bias of the placebo effect tests under the influence of measurement error and in the absence of confounding. Panels a, b, and c show density estimates of the placebo effect bias, ψ ˆψ, for the IV (blue) and regression (brown) estimators, for the data sets simulated under the influence of moderate, high, and extreme amounts of measurement error of the emotion level variable, but in the absence of confounding. The panels clearly show that the IV estimator (blue) produces considerably less biased placebo effect estimates than the regression estimator (brown), and that the amount of bias tends to increase with increasing amounts of measurement error (as one would expect). Panels d, e, and f present the distributions of the placebo effects estimates generated by the IV (blue) and regression (brown) estimators along with the distributions of the true placebo effects (black) used to simulate the synthetic data. These panels show that the placebo effect estimates generated by the regression approach tend to get more concentrated around zero, as the amount of measurement error increases, whereas the estimates generated by the IV estimator tend to be much closer to the distribution of the true placebo effects (note the similarity between the blue and black densities). 11/17

0.0 0.5 1.0 1.5 confounded, moderate ME IV bias 0.0 0.5 1.0 1.5 confounded, high ME IV bias 0.0 0.5 1.0 1.5 confounded, extreme ME IV bias 0.0 0.5 1.0 1.5 ψ ψ^ confounded, moderate ME True placebo effect IV estimate 0.0 0.5 1.0 1.5 (e) ψ ψ^ confounded, high ME True placebo effect IV estimate 0.0 0.5 1.0 1.5 (f) ψ ψ^ confounded, extreme ME True placebo effect IV estimate ψ^ ψ^ ψ^ Figure 4. Empirical bias of the placebo effect tests under the influence of measurement error and in the presence of confounding. Panels a, b, and c show density estimates of the placebo effect bias, ψ ˆψ, for the IV (blue) and regression (brown) estimators, for the data sets simulated under the influence of moderate, high, and extreme amounts of measurement error of the emotion level variable, but in the presence of confounding. The panels clearly show that the IV estimator (blue) produces considerably less biased placebo effect estimates than the regression estimator (brown), and that the amount of bias tends to increase with increasing amounts of measurement error (as one would expect). Panels d, e, and f present the distributions of the placebo effects estimates generated by the IV (blue) and regression (brown) estimators along with the distributions of the true placebo effects (black) used to simulate the synthetic data. These panels show that the placebo effect estimates generated by the regression approach tend to get more concentrated around zero, as the amount of measurement error increases, whereas the estimates generated by the IV estimator tend to be closer to the distribution of the true placebo effects. 12/17

placebo, unconfounded, β 0, ψ 0 placebo, confounded, β 0, ψ 0 IV: moderate ME IV: high ME IV: extreme ME Regr: moderate ME Regr: high ME Regr: extreme ME IV: moderate ME IV: high ME IV: extreme ME Regr: moderate ME Regr: high ME Regr: extreme ME Figure 5. Empirical power of the placebo effect tests under the influence of measurement error. Panel a shows that in the absence of confounding, the IV estimators tended to show high (independent of the amount of ME used in the simulations), whereas the power of the regression approach tended to decrease with increasing amounts of ME. Panel b, on the other hand, show that the regression estimator tended to be more powered than the IV one in the presence of confounding. It also shows that the power of the regression approach tended to decrease with increasing amounts of ME, whereas the power curves of the IV approach were the same, independent of the amount of measurement error. Density 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 no ME Density 0.0 0.1 0.2 0.3 0.4 0.5 moderate ME Density 0.00 0.05 0.10 0.15 0.20 0.25 0.30 high ME Density 0.00 0.02 0.04 0.06 0.08 0.10 extreme ME randomization null 3 3 randomization null randomization null 15 5 0 5 10 15 randomization null Figure 6. The randomization test for the placebo effect is robust to measurement error in the emotion level variable. Panels a, b, c and d present the randomization null distributions for the IV estimator of the placebo effect, under the no ME, moderate ME, high ME, and extreme ME settings, respectively. The 4 null distributions were generated using the same random permutations of the response data. In this particular example, the true placebo effect used to simulate the data was ψ = 1, and the respective estimated effects were 1.698, 2.099, 3.737, and 12.527. The respective estimated covariances between the instrument and the emotion levels were 0.165, 0.134, 0.075, and 0.022. In spite of the increasing spreads of the null distributions (due to the decreasing covariances between the instrument and emotion level), in all 4 cases we have that exactly 95 out of the 100,000 permutations of the response data, lead to statistics equal or larger than the respective estimates in the original data (shown by the red lines). In all 4 cases we also observed that exactly 105 permutations of the response data, lead to statistics equal or less than the negative of the observed estimates. Therefore, the two-tailed p-values derived from the 4 randomization tests are identical and equal to 0.002. 13/17

1.2 1.4 unconfounded, moderate ME IV: with placebo adjust. IV: without placebo adjust. 1.2 1.4 unconfounded, high ME IV: with placebo adjust. IV: without placebo adjust. 1.2 1.4 unconfounded, extreme ME IV: with placebo adjust. IV: without placebo adjust. 1.2 1.4 β β^ unconfounded, moderate ME True treatment effect IV (with placebo adjust.) IV (without placebo adjust.) β^ 1.2 1.4 (e) β β^ unconfounded, high ME True treatment effect IV (with placebo adjust.) IV (without placebo adjust.) β^ 1.2 1.4 (f) β β^ unconfounded, extreme ME True treatment effect IV (with placebo adjust.) IV (without placebo adjust.) β^ Figure 7. Empirical bias of the treatment effect tests under the influence of measurement error and in the absence of confounding. Panels a, b, and c show density estimates of the treatment effect bias, β ˆβ, for the regression (brown), placebo adjusted IV (blue) and unadjusted IV (red) estimators, for the data sets simulated under the influence of moderate, high, and extreme amounts of measurement error of the emotion level variable, but in the absence of confounding. The panels clearly show that the placebo adjusted IV estimator (blue) produces considerably less biased estimates than the regression (brown) and unadjusted IV (red) estimators, and that the amount of bias for the regression and placebo adjusted estimators tended to increase with increasing amounts of measurement error (note, however, that the bias is constant for the unadjusted IV estimator since it does not depend on the emotion level, M, and, hence, is not influence by measurement error on M). Panels d, e, and f present the distributions of the treatment effects estimates generated by the regression (brown), placebo adjusted IV (blue) and unadjusted IV (red) estimators along with the distributions of the true placebo effects (black) used to simulate the synthetic data. Note that the regression estimates (brown density) tend to approximate the unadjusted IV estimates (red density) as the amount of ME increases. The likely reason is that the regression estimator tends to automatically down play the contribution of M to the response Y in the presence of extreme ME. In other words, ˆψ R tends to be close to 0 (panel f of Figure 3), during the estimation of the parameters of the linear model Y = µ Y + βx + ψ M + ε Y, so that the treatment effect estimate produced by the regression of Y on both X and M tends to be similar to the estimate produced by regression Y on X alone (whose value tends to be similar to the estimate generated by the unadjusted IV estimator, in the absence of confounding). 14/17

0.0 0.1 0.2 0.3 0.4 0.5 0.6 confounded, moderate ME IV: with placebo adjust. IV: without placebo adjust. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 confounded, high ME IV: with placebo adjust. IV: without placebo adjust. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 confounded, extreme ME IV: with placebo adjust. IV: without placebo adjust. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 β β^ confounded, moderate ME True treatment effect IV (with placebo adjust.) IV (without placebo adjust.) β^ 0.0 0.1 0.2 0.3 0.4 0.5 0.6 (e) β β^ confounded, high ME True treatment effect IV (with placebo adjust.) IV (without placebo adjust.) β^ 0.0 0.1 0.2 0.3 0.4 0.5 0.6 (f) β β^ confounded, extreme ME True treatment effect IV (with placebo adjust.) IV (without placebo adjust.) β^ Figure 8. Empirical bias of the treatment effect tests under the influence of measurement error and in the presence of confounding. Panels a, b, and c show density estimates of the treatment effect bias, β ˆβ, for the regression (brown), placebo adjusted IV (blue) and unadjusted IV (red) estimators, for the data sets simulated under the influence of moderate, high, and extreme amounts of measurement error of the emotion level variable, but in the presence of confounding. Panel a shows that the placebo adjusted IV estimator (blue) produces less biased estimates than the regression (brown) and unadjusted IV (red) estimators in the presence of moderate amounts of measurement error. Panel b shows that the bias density of the placebo adjusted IV estimator is slightly more peaked around 0, but also shows heavier tails than the densities of the unadjusted IV (red) and regression (brown) estimators. Panel c, however, clearly shows that the placebo adjusted IV estimator can be more biased than the other estimators in the presence of extreme amounts of measurement error (note how the blue density puts less mass around 0, and have considerably heavier tails than the red and brown densities). Panels d, e, and f present the distributions of the treatment effects estimates generated by the regression (brown), placebo adjusted IV (blue) and unadjusted IV (red) estimators along with the distributions of the true placebo effects (black) used to simulate the synthetic data. These results illustrate that the presence of both confounding and measurement error makes the statistical inference considerably more challenging, specially in the presence of extreme amounts of measurement error. 15/17

treatment, unconfounded, β 0, ψ 0 treatment, confounded, β 0, ψ 0 IV: adjusted by placebo, moderate ME IV: adjusted by placebo, high ME IV: adjusted by placebo, extreme ME IV: no placebo adjustment, moderate ME IV: no placebo adjustment, high ME IV: no placebo adjustment, extreme ME : moderate ME : high ME : extreme ME IV: adjusted by placebo, moderate ME IV: adjusted by placebo, high ME IV: adjusted by placebo, extreme ME IV: no placebo adjustment, moderate ME IV: no placebo adjustment, high ME IV: no placebo adjustment, extreme ME : moderate ME : high ME : extreme ME Figure 9. Empirical power of the treatment effect tests under the influence of measurement error. Panels a and b report the for the regression (brown), placebo adjusted IV (blue) and unadjusted IV (red) estimators, for the data sets simulated under the influence of moderate, high, and extreme amounts of measurement error of the emotion level variable, in the absence and in the presence of confounding, respectively. In both panels the regression and unadjusted IV estimators tended to be higher powered than the placebo adjusted IV estimator (whose power tended to decrease with increasing amounts of ME). 16/17

placebo, unconfounded, β = 0, ψ = 0 placebo, confounded, β = 0, ψ = 0 empirical type I error IV: moderate ME IV: high ME IV: extreme ME Regr: moderate ME Regr: high ME Regr: extreme ME empirical type I error IV: moderate ME IV: high ME IV: extreme ME Regr: moderate ME Regr: high ME Regr: extreme ME treatment, unconfounded, β = 0, ψ = 0 treatment, confounded, β = 0, ψ = 0 empirical type I error IV: adjusted by placebo, moderate ME IV: adjusted by placebo, high ME IV: adjusted by placebo, extreme ME IV: no placebo adjustment, moderate ME IV: no placebo adjustment, high ME IV: no placebo adjustment, extreme ME : moderate ME : high ME : extreme ME empirical type I error IV: adjusted by placebo, moderate ME IV: adjusted by placebo, high ME IV: adjusted by placebo, extreme ME IV: no placebo adjustment, moderate ME IV: no placebo adjustment, high ME IV: no placebo adjustment, extreme ME : moderate ME : high ME : extreme ME Figure 10. Empirical type I error rates of the placebo and treatment effect tests under the influence of measurement error. Panels a and b present the results for the placebo effect tests in the unconfounded and confounded cases, respectively, and show that the error rates of the IV approach (blue) are still controlled at the exact nominal levels in the presence of measurement error, whereas the regression approach (brown) still shows inflated error rates in the presence of confounding. Panels c and d present the respective results for the treatment effect tests. Both panels show well controlled error rates for the unadjusted IV approach (red), but slightly inflated error rates for the placebo adjusted IV approach (blue). The regression approach (brown), on the other hand, shows well controlled error rates in the absence of confounding (panel c) but inflated errors in the presence of confounding (panel d). Results based on 6 additional simulation experiments, each one encompassing 10,000 distinct data sets. 17/17