Fixing the replicability crisis in science. Jelte M. Wicherts

Fixing the replicability crisis in science Jelte M. Wicherts 1

Supporting responsible research practices Diminishing biases in research Lowering false positive ratios Fixing the replicability crisis in science Enhancing the trust in scientific findings Making science more efficient Improving scientific reproducibility Empowering the truth in science Scaring away scientific cowboys 2

Empirical cycle Observe (literature) Evaluate (present) Hypothesize Test (collect & analyze data) Predict (Set-up exp.) 3

Success rates across the sciences Source: Fanelli, D. (2010). Positive results increase down the hierarchy of the sciences. PloS one, 5(4), e10068. 4

Fraud Observe (literature) Evaluate (present) Hypothesize Test (collect & analyze data) Predict (Set-up exp.) 5

How to counter scientific misconduct Improve regulations & procedures Training in responsible conduct of research Lower questionable research practices Enhance transparency and accountability 6

Open science practices Heightens reproducibility and data re-use Leads to loss of sleep among scientific fraudsters Sources: Wicherts, J. M. (2011). Psychology must learn a lesson from fraud case. Nature, 480, 7. Wicherts, J. M. & Bakker, M. (2012). Publish (your data) or (let the data) perish! Why not publish your data too? 7 Intelligence, 40, 73-76.

HARKing Observe (literature) Evaluate (present) Hypothesize Test (collect & analyze data) Predict (Set-up exp.) 8

y(t) -2-1 0 1 2 3 HARKing White Noise 0 100 200 300 400 500 time the infamous one-year dip! (at t=365) 9

HARKing 10

Explorative research 11

Suboptimal designs Observe (literature) Evaluate (present) Hypothesize Test (collect & analyze data) Predict (Set-up exp.) 12

Poor design 13

Study number Distribution under H 0 Distribution under H A Power N l = 1.85 128 2.55 72 3.40 48 4.25 28 1 0 1 2 Actual effect size

N % effects are likely to be inflated estimates of the cts, given the problems associated with small escribed above. Power failure in neuroscience esults described in this section are based on meta-analyses, and we should be appropriately in extrapolating from this limited evidence. less, it is notable that the results are so conith those observed in other fields, such as the results indicated that the median statistical power of these studies was 8% across 461 individual studies aging and neuroscience studies that we have above. ions ions for the likelihood that a research finding true effect. Our results indicate that the avertical power of studies in the field of neurosciobably no more than between ~8% and ~31%, sis of evidence from diverse subfields within ience. If the low average power we observed ese studies is typical of the neuroscience lits a whole, this has profound implications for A major implication is that the likelihood that inally significant finding actually reflects a true mall. As explained above, the probability that h finding reflects a true effect (PPV) decreases ical power decreases for any given pre-study and a fixed type I error level. It is easy to show even lower than the 8 31% range we observed. Ethical implications. Low average power in neuroscience studies also has ethical implications. In our analysis of animal model studies, the average sample size of 22 animals for the water maze experiments was only sufficient to detect an effect size of d = 1.26 with Source: Button, K. S. et al. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14, 1-12. of meta-analyses (%) on the right axis. There is a clear 15 16 14 12 10 8 6 4 2 0 0 10 11 20 21 30 31 40 41 50 51 60 Power (%) 61 70 71 80 81 90 91 100 Figure 3 Median power of studies included in neuroscience meta-analyses. The figure shows a histogram of median study power calculated for each of the n = 49 meta-analyses included in our analysis, with the number of meta-analyses (N) on the left axis and percent bimodal distribution; n = 15 (31%) of the meta-analyses 30 25 20 15 10 5 0

Fixing the power! Powerful designs collaborate 16

File-drawer problem Observe (literature) Evaluate (present) Hypothesize Test (collect & analyze data) Predict (Set-up exp.) 17

Publication bias When is a study truly failed? Blackboard in the office of a couple of PhD students 18

Power intuitions Marjan Bakker asked 291 psychologists to indicate their: -typical effect size -typical sample size -typical power β α=0.05 I usually aim for 20 25 subjects per cell of the experimental design, which is typically what it takes to detect a medium effect size with.80 probability. Actual power =.35 80% of respondents overestimated the power of their studies Source: Bakker, M., Wicherts, J. M., Hartgerink, C. H. J., & van der Maas, H. L. J. (2016). Researcher s intuitions about power in psychological research. Psychological Science, 27, 1069-1077

Poor statistical intuitions Researchers over overly optimistic to find evidence when they are right Source: Bakker, M., Wicherts, J. M., Hartgerink, C. H. J., & van der Maas, H. L. J. (2016). Researcher s intuitions about power in psychological research. Psychological Science, 27, 1069-1077 20

Failed study /fāld stuhd-ee/ 1. Is an empirical study in which unforeseen problems occurred during the data collection 2. Colloquial expression used in the sciences before 2018 to denote studies with (disappointing) nonsignificant outcomes that were deemed unpublishable Source: van Assen, M. A., van Aert, R. C., Nuijten, M. B., & Wicherts, J. M. (2014). Why publishing everything is more effective than selective publishing of statistically significant results. PLOS ONE, 9, e84896

Overly positive reporting Observe (literature) Evaluate (present) Hypothesize Test (collect & analyze data) Predict (Set-up exp.) 22

Selective outcome reporting Stress Fysiol. Measure Observ. behavior Selfreport p<.05 p<.05 p>.05 23

Evidence for this in clinical trials Source: Compare trials project (Ben Goldacre et al.) 24

Errors in the reporting of statistical results p =.06 Source: Bakker, M. & Wicherts, J. M. (2011). The (mis)reporting of statistical results in psychology journals. Behavior Research Methods, 43, 666-678. 25

Source: Nuijten, M. B., Hartgerink, C.H.J., Van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2016). The Prevalence of Statistical Reporting Errors in Psychology (1985-2013). Behavior Research Methods 26

Fixing misreporting & selective reporting Reporting guidelines -STROBE -PRISMA -ARRIVE -STARD -CARE Peer review with checklists Statcheck & other tools 27

P-Hacking Observe (literature) Evaluate (present) Hypothesize Test (collect & analyze data) Predict (Set-up exp.) 28

P-hacking Remove outliers (Z > 2 ) p>.05 Call this a failed study Perform new study p>.05 p<.05 Add 10 cases p<.05 p>.05 Redo analysis with adapted? dependent var. p>.05 p<.05 Effect! Planned analysis p<.05 Write paper Misreport the p-value as being <.05 29

Many ways to analyse the data. imply many ways to reach the stars* *p<.05 Source: Wicherts et al. (2016). Degrees of freedom in planning, running, analyzing, and reporting psychological studies. A checklist to avoid p-hacking. Frontiers in Psychology, 7, 1832. 30

P-hacking pervasive? 12.5% articles misreporting p-values 87.5% articles using more subtle ways of p-hacking?? 31

Scientists are only human This one SHOULD really be higher! If not my reviewers will kill my paper There MUST be something wrong with this analysis or with these data And I can forget about getting tenure And I cannot buy the house I wanted 32

Pre-register studies Specify hypotheses & analyses in advance Publish the pre-registration Or publish a Registered Report (RR) in which the peer review is focused on the rationale, hypotheses, and methods (and article is published regardless of the results) Sources: Chambers, C. D. (2013). Registered Reports: A new publishing initiative at Cortex. Cortex. Wagenmakers et al. (2012). An Agenda for Purely Confirmatory Research. Perspectives on Psychological Science, 7, 632-638. 33

Pre-registration challenge 34

Big money Observe (literature) Evaluate (present) Hypothesize Test (collect & analyze data) Predict (Set-up exp.) 35

Big money The business model of publishers is not necessarily in line with goals of furthering science. Is non-profit publishing the answer? Robert Maxwell https://www.theguardian.com/science/2017/jun/27/profitable-business-scientificpublishing-bad-for-science 36

Big money Incentivize the right behaviors 37

Lack of replication Observe (literature) Observe (literature) Evaluate (present) Hypothesize Evaluate (present) Hypothesize Test (collect & analyze data) Predict (Set-up exp.) Test (collect & analyze data) Predict (Set-up exp.) Use cross-validation or other holdout sample techniques 38

Problems and how to fix them Problems Scientific misconduct Observer bias & effects File drawer problem HARKing Errors in reporting & selective outcome reporting P-hacking Lack of replication Solutions Open data & regulations Blinding Power, banning failed study Pre-registration Reporting guidelines, reviewer checklists statcheck Pre-registration Incentivize & analyze sensibly 39

The first principle is that you must not fool yourself and you are the easiest person to fool. Richard P. Feynman 40

Metaresearch.nl Hilde Augusteijn Marjan Bakker Marcel van Assen Amir Abdol Michele Nuijten Coosje Veldkamp + Esther Maassen Andrea Stoevenbelt Robbie van Aert Linda Dominguez Alvarez Chris Hartgerink Paulette Flore Olmo van den Akker @JelteWicherts 41