Why Hypothesis Tests Are Essential for Psychological Science: A Comment on Cumming. University of Groningen. University of Missouri

Similar documents
THE INTERPRETATION OF EFFECT SIZE IN PUBLISHED ARTICLES. Rink Hoekstra University of Groningen, The Netherlands

Bayes Factors for t tests and one way Analysis of Variance; in R

Response to the ASA s statement on p-values: context, process, and purpose

Statistical Evidence in Experimental Psychology: An Empirical Comparison Using 855 t Tests

Bayesian Mixture Modeling of Significant P Values: A Meta-Analytic Method to Estimate the Degree of Contamination from H 0

CURRICULUM VITAE MARJAN BAKKER PERSONAL WORK EXPERIENCE EDUCATION

Why Bayesian Psychologists Should Change the Way They Use the Bayes Factor

Recency order judgments in short term memory: Replication and extension of Hacker (1980)

Guidelines for reviewers

Why Psychologists Must Change the Way They Analyze Their Data: The Case of Psi

Can Bayesian models have normative pull on human reasoners?

Where does "analysis" enter the experimental process?

Response to Comment on Cognitive Science in the field: Does exercising core mathematical concepts improve school readiness?

Why Psychologists Must Change the Way They Analyze Their Data: The Case of Psi: Comment on Bem (2011)

UvA-DARE (Digital Academic Repository) Bayesian benefits with JASP Marsman, M.; Wagenmakers, E.M.

The Interplay between Subjectivity, Statistical Practice, and Psychological Science

When Falsification is the Only Path to Truth

Statistics for Social and Behavioral Sciences. Advisors: S.E. Fienberg W.J. van der Linden

baseline comparisons in RCTs

Keywords NHST; Bayesian inference; statistical analysis; mental disorders; fear learning

Bayesian and Classical Hypothesis Testing: Practical Differences for a Controversial Area of Research

Equivalent statistics and data interpretation

Language: English Course level: Doctoral level

Rejoinder to discussion of Philosophy and the practice of Bayesian statistics

Audio: In this lecture we are going to address psychology as a science. Slide #2

Errata to the Examples Section of: Bayesian Tests to Quantify the Result of a Replication Attempt

CHAPTER THIRTEEN. Data Analysis and Interpretation: Part II.Tests of Statistical Significance and the Analysis Story CHAPTER OUTLINE

The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective

Registered Reports - Guidelines for reviewers

Ruud Wetzels. Ruud Wetzels Nachtwachtlaan ED, Amsterdam The Netherlands

Estimating Bayes Factors for Linear Models with Random Slopes. on Continuous Predictors

RISK AS AN EXPLANATORY FACTOR FOR RESEARCHERS INFERENTIAL INTERPRETATIONS

Descending Marr s levels: Standard observers are no panacea. to appear in Behavioral & Brain Sciences

The Bayesian Evaluation of Categorization Models: Comment on Wills and Pothos (2012)

Coherence and calibration: comments on subjectivity and objectivity in Bayesian analysis (Comment on Articles by Berger and by Goldstein)

Improving Inferences about Null Effects with Bayes Factors and Equivalence Tests

There s More Than One Way to Conduct a Replication Study: Beyond Statistical Significance

Checking the counterarguments confirms that publication bias contaminated studies relating social class and unethical behavior

How Do Individual Experiences Aggregate to Shape Personality Development? Elliot M. Tucker-Drob

The philosophy of Bayes factors and the quantification of statistical evidence

Working in post conflict zone is as stressful as working home

A Tutorial on a Practical Bayesian Alternative to Null-Hypothesis Significance Testing

Bayesian and Classical Approaches to Inference and Model Averaging

Undesirable Optimality Results in Multiple Testing? Charles Lewis Dorothy T. Thayer

A Bayesian perspective on the Reproducibility Project: Psychology. Abstract. 1 Introduction 1

The Objectivity of Subjective Bayesian Inference

Commentary on: Piaget s stages: the unfinished symphony of cognitive development by D.H. Feldman

A Case Study Primer in the Context of Complexity

Structural Modelling of Operational Risk in Financial Institutions:

Wason's Cards: What is Wrong?

How to use the Lafayette ESS Report to obtain a probability of deception or truth-telling

Multiple Trials May Yield Exaggerated Effect Size Estimates

A Bayesian alternative to null hypothesis significance testing

Copyright American Psychological Association

From: AAAI Technical Report SS Compilation copyright 1995, AAAI ( All rights reserved.

Bayesian and Frequentist Approaches

Inference About Magnitudes of Effects

Running head: BAYES FACTOR. A Tutorial on Testing Hypotheses Using the Bayes Factor. Herbert Hoijtink

UvA-DARE (Digital Academic Repository) Network psychometrics Epskamp, S. Link to publication

Underreporting in Psychology Experiments: Evidence From a Study Registry

DRAFT (Final) Concept Paper On choosing appropriate estimands and defining sensitivity analyses in confirmatory clinical trials

The Scientific Enterprise. The Scientific Enterprise,

SOME PRINCIPLES OF FIELD EXPERlMENTS WITH SHEEP By P. G. SCHINCICEL *, and G. R. MOULE *

Research Parasites Are Beneficial for the Organism as a Whole: Competition between Researchers Creates a Symbiotic Relationship

Testing Theory-Based Quantitative Predictions with New Behaviors

Four reasons to prefer Bayesian over orthodox statistical analyses. Zoltan Dienes. University of Sussex. Neil Mclatchie. Lancaster University

The Frequentist Implications of Optional Stopping on Bayesian Hypothesis Tests

SDT Clarifications 1. 1Colorado State University 2 University of Toronto 3 EurekaFacts LLC 4 University of California San Diego

Response Tendency in a Questionnaire

The Importance of Jeffreys s Legacy (Comment on Robert, Chopin, and Rousseau)

CURRICULUM VITAE DORA MATZKE

An Efficient Method for the Minimum Description Length Evaluation of Deterministic Cognitive Models

UvA-DARE (Digital Academic Repository)

Predicting the Truth: Overcoming Problems with Poppers Verisimilitude Through Model Selection Theory

Three Attempts to Replicate the Behavioral Sunk-Cost Effect: A Note on Cunha and Caldieraro (2009)

Transfer of Dimensional Associability in Human Contingency Learning

STATISTICAL INFERENCE 1 Richard A. Johnson Professor Emeritus Department of Statistics University of Wisconsin

Conference Program. Friday July 8, 2016

Long-range dependence in human movements: detection, interpretation, and modeling perspectives

LEARNING. Learning. Type of Learning Experiences Related Factors

A Bayes Factor for Replications of ANOVA Results

Confirmation Bias. this entry appeared in pp of in M. Kattan (Ed.), The Encyclopedia of Medical Decision Making.

:: Slide 1 :: :: Slide 2 :: :: Slide 3 :: :: Slide 4 :: :: Slide 5 :: :: Slide 6 ::

COHERENCE: THE PRICE IS RIGHT

On A Distinction Between Access and Phenomenal Consciousness

Integrating information from multiple sources: A metacognitive account of self-generated and externally provided anchors

The fallacy of placing confidence in confidence intervals

A Mini-conference on Bayesian Methods at Lund University 5th of February, 2016 Room C121, Lux building, Helgonavägen 3, Lund University.

Explicit Bayes: Working Concrete Examples to Introduce the Bayesian Perspective.

INTELLIGENCE AND CREATIVITY

The Effect of Horizontal Eye Movements on Free Recall: A Preregistered Adversarial. Collaboration

The philosophy of Bayes factors and the quantification of statistical evidence

Supplementary Materials: Materials and Methods Figures S1-S2 Tables S1-S17 References

Predictions from Uncertain Beliefs

arxiv: v1 [stat.me] 5 Nov 2018

Making Null Effects Informative: Statistical Techniques and Inferential Frameworks

Science Directorate. November 6, 2000

Estimation statistics should replace significance testing

PALPABLE EXISTENTIALISM: AN INTERVIEW WITH EUGENE GENDLIN

Transcription:

Running head: TESTING VS. ESTIMATION Why Hypothesis Tests Are Essential for Psychological Science: A Comment on Cumming Richard D. Morey 1, Jeffrey N. Rouder 2, Josine Verhagen 3 and Eric-Jan Wagenmakers 3 1 University of Groningen 2 University of Missouri 3 University of Amsterdam Word count: 896 Address Correspondence to: Richard D. Morey Grote Kruisstraat 2/1 9712TS Groningen, The Netherlands r.d.morey@rug.nl Author note: This work was supported by the starting grant Bayes or Bust awarded by the European Research Council, and National Science Foundation grants BCS-1240359 and SES-102408.

Abstract Psychological Science recently announced changes to its publication guidelines (Eich, in press). Among these are many positive changes that will increase the quality of the scientific results published in the journal. One of the changes emphasized by Cumming (in press) is an increased emphasis on estimation, as opposed to hypothesis testing. We argue that estimation alone is illsuited to science, in which testing predictive models of phenomena are a key goal. Both estimation and hypotheses testing methods are essential.

Why Hypothesis Tests Are Essential for Psychological Science: A Comment on Cumming In a much anticipated move, Psychological Science announced important changes in its publication practices (Eich, in press); specifically, the changes promote open science (e.g., open data, open materials, and preregistration) and recommend that researchers report confidence intervals instead of p-values. Inference by confidence interval, it is argued, constitutes a "new statistics" in psychology (Cumming, in press; Grant, 1962) and avoids the pitfalls of null hypothesis significance testing. To embrace the new statistics, Cumming recommends a shift to estimation as "usually the most informative approach," and states that "interpretation [of data] should be based on estimation." We broadly agree with the recommendations for more quantitative thinking, more openness and transparency, and an end to p-values. Nonetheless, the benefits of estimation have been overstated, and the mistaken idea that estimation is superior to hypothesis testing becoming the conventional wisdom in psychology. This conventional wisdom is perpetuated by the APA Manual, in the new statistical guidelines for journals of the Psychonomic Society, the Society for Personality and Social Psychology Task Force on Publication and Research Practices, and now also by Psychological Science. However, estimation alone is insufficient to move psychology forward as a science; proper hypothesis testing methods are crucial. Scientific research is often driven by theories that unify a diverse number of previous observations and make clear predictions. For instance, in physics one might test for the existence of the Higgs boson (Low, Lykken, & Shaughnessy, 2012). In biology, one might compare various phylogenetic hypotheses about how species are related (Huelsenbeck & Ronquist, 2001). In psychology, one might test whether fluid

intelligence can be improved by training (Harrison et al., 2013). Testing hypotheses is not as simple as looking at data to see whether they agree with the theory. There are three necessary components to testing a theory: first, one must know what one would expect of the data if the theory were true; second, one must know what one would expect if the theory were false; and third, one must have a principled method for using the data to make an inference about the theory. The second and third components are crucial. Inferring support for a theory on the sole basis of agreement between the observations and the theory is a logical fallacy (known as the converse error; Popper, 1959); having no principled method for inferring support leaves one with only ad hoc rules subject to one's own biases. The difficulties inherent in using estimation to test theory can be illustrated by the example chosen by Cumming (in press, p.15). Velicer et al. (2008) tested a theoreticallymotivated model of smokers' readiness to stop smoking. The authors predicted the strength of association between 15 inventory scales and the Stage of Change from the Transtheoretical model of behavioral change (Prochaska & Velicer, 1997). To assess the model they determined whether confidence intervals contained their predictions. Eleven of the 15 predictions were included in 95% confidence intervals, which Velicer et al. state provid[es] overall support for the theoretical model. There are two issues with this assessment. First, it is arbitrary. Would 10 out of 15 also support the theory, or instead refute it? If we instead used 99% CIs, would 12 out of 15 be enough to support the theory? This arbitrariness arises because CIs offer no principled method for generating an inference regarding the theory. Second, no indication is given of what one would expect if the theory were false. If one would expect similar data even when the theory is false, then the observed data cannot be said to support the theory. In Cumming's example, two out of three necessary conditions for testing theory are missing.

We are solidly in agreement with Cumming, however, that any reliance on null hypothesis significance testing (NHST) should be avoided. NHST also fails to consider predictions when the null hypothesis is false and thus also cannot provide support for theory. If traditional hypothesis testing is inappropriate, what should replace it? One possibility that we advocate is Bayesian model comparison (Kass & Raftery, 1995; Rouder, Morey, Speckman, & Province, 2012; Wagenmakers, Wetzels, Borsboom, & van der Maas, 2011), which meets the three necessary conditions for theory testing without suffering from the problems that plague NHST. In Bayesian model comparison, the probability of the observed data is compared across various models. This comparison is justified by Bayes' theorem, yielding a principled, continuous measure of relative evidence called the Bayes factor. Non-Bayesian methods for model comparison exist as well (Burnham & Anderson, 2002; Grünwald, 2007), though a discussion of the specifics of the different approaches is outside the scope of this comment. For psychological science to be a healthy science, both estimation and hypothesis testing are needed. Estimation is necessary in pre-theoretical work before clear predictions can be made, and in post-theoretical work for theory revision. But hypothesis testing, not estimation, is necessary for testing the quantitative predictions of theories. Neither hypothesis testing nor estimation is more informative than the other; rather, they answer different questions. Using estimation alone turns science into an exercise in keeping records about the effect sizes in diverse experiments, producing a massive catalog devoid of theoretical content; using hypothesis testing alone may cause researchers to miss rich, meaningful structure in data. For researchers to obtain principled answers to the full range of questions they might ask, it is crucial for estimation and hypothesis testing to be advocated side by side.

References Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference: A practical information theoretic approach (second ed.). New York, NY: Springer-Verlag. Cumming, G. (in press). The new statistics: Why and how. Psychological Science. Eich, E. (in press). Business not as usual. Psychological Science. Grant, D. A. (1962). Testing the null hypothesis and the strategy and tactics of investigating theoretical models. Psychological Review, 69, 54-61. Grünwald, P. (2007). The minimum description length principle. Cambridge, MA: MIT Press. Harrison, T. L., Shipstead, Z., Hicks, K. L., Hambrick, D. Z., Redick, T. S., & Engle, R. W. (2013). Working memory training may increase working memory capacity but not fluid intelligence. Psychological Science, 24 (12), 2409-2419. Huelsenbeck, J. P., & Ronquist, F. (2001). MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics, 17 (8), 754-755. Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90, 773-795. Low, I., Lykken, J., & Shaughnessy, G. (2012). Have we observed the Higgs boson (imposter)? Physical Review D - Particles, Fields, Gravitation and Cosmology, 86. Popper, K. (1959). The logic of scientific discovery. (2002 e-library ed.) London: Routledge Classics. Prochaska, J. O., & Velicer, W. (1997). The transtheoretical model of health behavior change. American Journal of Health Promotion, 12, 38-48. Rouder, J. N., Morey, R. D., Speckman, P. L., & Province, J. M. (2012). Default Bayes

factors for ANOVA designs. Journal of Mathematical Psychology, 56, 356-374. Velicer, W. F., Cumming, G., Fava, J. L., Rossi, J. S., Prochaska, J. O., & Johnson, J. (2008). Theory testing using quantitative predictions of effect size. Applied Psychology, 57 (4), 589-608. Wagenmakers, E.-J., Wetzels, R., Borsboom, D., & van der Maas, H. (2011). Why psychologists must change the way they analyze their data: The case of psi. A comment on Bem (2011). Journal of Personality and Social Psychology, 100, 426-432.