The Laryngoscope VC 2014 The American Laryngological, Rhinological and Otological Society, Inc. Systematic Review A Review of Multiple Hypothesis Testing in Otolaryngology Literature Erin M. Kirkham, MD, MPH; Edward M. Weaver, MD, MPH Objectives/Hypothesis: Multiple hypothesis testing (or multiple testing) refers to testing more than one hypothesis within a single analysis, and can inflate the type I error rate (false positives) within a study. The aim of this review was to quantify multiple testing in recent large clinical studies in the otolaryngology literature and to discuss strategies to address this potential problem. Data Sources: Original clinical research articles with >100 subjects published in 2012 in the four general otolaryngology journals with the highest Journal Citation Reports 5-year impact factors. Review Methods: Articles were reviewed to determine whether the authors tested greater than five hypotheses in at least one family of inferences. For the articles meeting this criterion for multiple testing, type I error rates were calculated, and statistical correction was applied to the reported results. Results: Of the 195 original clinical research articles reviewed, 72% met the criterion for multiple testing. Within these studies, there was a mean 41% chance of a type I error and, on average, 18% of significant results were likely to be false positives. After the Bonferroni correction was applied, only 57% of significant results reported within the articles remained significant. Conclusions: Multiple testing is common in recent large clinical studies in otolaryngology and deserves closer attention from researchers, reviewers, and editors. Strategies for adjusting for multiple testing are discussed. Key Words: Multiple testing, multiple comparisons, clinical research methodology, study design, database, Bonferroni, Bonferroni-Holm, family-wise error rate, error rate per comparison, percent error rate. Level of Evidence: NA Laryngoscope, 125:599 603, 2015 INTRODUCTION The P value is often considered to be the statistical litmus test for the significance of an experimental result. In actuality, the P value for any statistical test is the probability of being wrong in rejecting the null hypothesis and thus in accepting the alternative (i.e., original) hypothesis. In other words, it is the probability of a type I error or obtaining a false-positive result. If one is testing for the difference in a characteristic between From the Department of Otolaryngology/Head & Neck Surgery (E.M.K., E.M.W.), University of Washington, Seattle, Washington; and Surgery Service, Department of Veterans Affairs Medical Center, Seattle, Washington (E.M.W), U.S.A. Editor s Note: This Manuscript was accepted for publication July 7, 2014. This work was conducted at the University of Washington and Veterans Affairs Medical Center, Seattle, Washington. This material is the result of work supported by resources from the Veterans Affairs Puget Sound Health Care System, Seattle, Washington, and from the National Institutes of Health (T32 DC000018). The authors have no other funding, financial relationships, or conflicts of interest to disclose. Send correspondence to Erin M. Kirkham, MD, MPH, Department of Otolaryngology/Head & Neck Surgery, University of Washington, 1959 NE Pacific Street Box 356515, Seattle, WA 98195-6515. E-mail: ekirkham@uw.edu DOI: 10.1002/lary.24857 two groups, the null hypothesis states that there is no difference, whereas the alternative hypothesis states that there is a difference. If the P value of the test is.05, then even though the difference appears significant, there is still a 5% chance that the null hypothesis is true and the alternative hypothesis is false, meaning the difference between groups is due to chance alone. It is standard for researchers to accept a 5% chance of obtaining a false positive (type I error) when conducting a statistical test, which corresponds to a significance level of.05. The meaning of the P value changes, however, when multiple hypotheses are tested within a single analysis, frequently referred to as a family of inferences. 1 This concept is demonstrated with the probabilities of rolling dice. When rolling one die, the chance of a six is 1/6, or 17%. When 10 dice are rolled, the chance of at least one landing on six is 84%. Similarly, when multiple hypotheses are tested, each at a significance level of.05, the chance of obtaining at least one false positive rises precipitously with the number of hypotheses tested. This probability is called the family-wise error rate and is calculated by the formula 12(12a) c where a is the significance level (constant across all tests) and c denotes the number of simultaneous comparisons (or hypothesis tests). 2 For five simultaneous tests, the chance of a single 599
false positive is 23%, which is much greater than the commonly accepted 5%. Multiple testing occurs more often in some types of studies than in others. Meta-analyses that pool large amounts of data from multiple studies and clinical trials with factorial designs and multiple end points are both prone. 3,4 However, multiple testing is most often found in exploratory analyses of large databases, as data collected on numerous variables lend themselves to conducting multiple simultaneous tests using the same sample. 1 This does not mean that studies of this type cannot provide meaningful results, as there are many methodological approaches to account for multiple testing. In the design phase of a study, one can predesignate a small number of primary variables of interest while acknowledging that the remaining variables are exploratory. Exploratory findings are considered hypothesis generating and then are tested separately in an independent sample of subjects to produce acceptable support to guide clinical decision making. 2 Confirmation can be done in a follow-up study or in a separate segment of the study population not included in the exploratory analysis. One can also consider the consistency of results when reporting exploratory findings; if the strength and direction of association are consistent among similar variables, it lends credence to the results. Drawing accurate inference directly from an exploratory study can be done but requires a statistical correction to the significance level, of which there are many available. 2,5 7 The Bonferroni correction generates a stricter significance cutoff by dividing the original significance level by the number of comparisons. For an analysis testing five hypotheses, each at a significance level of.05, the corrected significance level would be.05/5, or.01. This is the simplest, yet one of the most conservative corrections. The Bonferroni-Holm method results in a less conservative significance correction, decreasing the chance of a false-negative result relative to the Bonferroni correction. 8 In this method, the P values are ranked from largest to smallest, and then each is compared to a significance level that is equal to.05 divided by rank. For example, in a family of inferences with five simultaneous hypotheses in which.005 is the smallest P value (rank 5 5), the adjusted significance level would be.05/5 5.01, and the test would remain significant. If the next smallest P value (rank 5 4) was.04, then it would no longer be significant when compared to the adjusted significance level of.05/4 5.0125. These are just two examples of the many methods available for correction of the significance level. Multiple testing has yet to be examined systematically in the otolaryngology literature. The aim of the current study was to quantify multiple testing within recent large clinical studies in otolaryngology, to describe methods used for correction, and to present a review of methodological approaches to avoiding false positives when multiple hypotheses are tested. The authors hypothesize that similar to past findings in other fields 9,10 uncorrected multiple testing is common in large clinical studies in the otolaryngology literature, and that uncorrected multiple testing and resulting high error rates will be Fig. 1. Literature search and review. MHT 5 multiple hypothesis testing. associated with retrospective study design, large numbers of subjects, and use of large databases. METHODS Article Selection The literature search included all articles published in 2012 in the four largest-volume U.S. general otolaryngology journals with the highest Journal Citation Reports (Thomson Reuters, Philadelphia, PA) 5-year impact factors. 11 These four publications were The Laryngoscope; Archives of Otolaryngology Head & Neck Surgery (now called JAMA Otolaryngology Head & Neck Surgery); Otolaryngology Head & Neck Surgery; and Annals of Otology, Rhinology, and Laryngology. The search strategy and results are depicted in Figure 1. We included all articles with abstracts on original human subjects research with samples of >100 subjects; we excluded descriptive studies (no hypothesis testing) and genomic studies, meta-analyses, and cost-effectiveness analyses due to their use of unique statistical methods. We included 195 articles in our analysis (Fig. 1). Data Collection For each article, the source journal; number of subjects; study design (retrospective, prospective, or cross-sectional); and 600
use of a national, statewide, or multi-institutional database were recorded. The number of distinct families of hypothesis tests was determined for each article. In cases where the distinction between families of tests was not clear, the determination was made so as to minimize the number of hypotheses tested per analysis and thus minimize the incidence of multiple testing (conservative bias). For example, an article testing the treatment effect of an intervention as measured by four different outcomes separately in men and women has eight total hypotheses being tested. These eight hypotheses can be grouped into two families of four hypotheses each, so two distinct families of four hypothesis tests were recorded. However, in an article testing whether eight risk factors were associated with a particular disease, this was considered one family of eight hypothesis tests. For each distinct family of hypothesis tests, we recorded the number of total hypotheses tested, the significance level(s), and each significant P value. In cases where the number of total hypotheses tested was not explicitly stated, an estimate was made based on the description of variables tested within the methods. The articles were also examined for any method of accounting for multiple testing. In addition, articles were examined for any indication that more tests were conducted than were reported. This was inferred if the methods section listed a greater number of variables tested than were reported in the results, or if it appeared that the authors subdivided a continuous variable into multiple sets of categorical variables for hypothesis testing. Multiple testing was defined as testing five or more simultaneous hypotheses within a family of inferences because the family-wise error rate for a standard significance level of 0.05 rises to >20% with at least five simultaneous hypotheses tested. Analysis Descriptive statistics of the articles, analyses, and hypotheses were reported as mean 6 standard deviation, median, and range. The following statistics were calculated for each article with multiple testing. 12 Each error rate calculation assumes that the tests within a family are independent of one another. These formulae quantify the risk of false positives in three different ways: probability, number, and percentage. Family-wise error rate: the probability of making at least one type I error (false positive) in a family of inferences. Family-wise error rate 5 12(12a)c where a is the significance level (constant across all tests) and c is the number of comparisons (or hypotheses tested). For an analysis testing five hypotheses each at a significance level of.05, this would be 1 2 (120.05)5 5.23. Error rate per experiment: the expected number of type I errors (false positives) in a particular group of statistical significance tests. Error rate per experiment 5 c(a), where c is the number of comparisons, and a is the significance level (constant across all tests). For an analysis testing five hypotheses each at a significance level of.05, this would be 5(.05) 5.25. Percent error rate: the percentage of results labeled as statistically significant that are likely to be chance results. Percent error rate 5 100ca/M, where c is the total number of comparisons, a is the significance level for a set of comparisons, and M is the number of statistical tests with P values less than the designated significance level. For an analysis testing five hypotheses each at a significance level of.05, with three out of five significant, this would be 100(5)(.05) / 3 5 8.3%. TABLE I. Characteristics of Articles (N 5 195). Characteristic Mean 6 SD Median Range Subjects 869,636 6 281 101 98,800,000 7,668,879 Families of tests 2.8 6 2.0 2 1 11 (per article) Hypotheses (per family of tests) 9.1 6 10.2 6 1 141 SD 5 standard deviation. Families of tests that addressed in whole or in part the stated aim of the article were considered primary, and those that did not were considered secondary. For each primary family of inferences, the total number of comparisons and all significant P values were recorded. Both the Bonferroni and the Bonferroni-Holm correction were applied to the significant P values within each family of inferences to generate corrected results. In cases where P values were reported as less than a cutoff value (e.g., P <.01), the P value was recorded below that value at the next significant digit (e.g., P 5.009). In cases where P values were tied, they were assigned the same rank at the highest rank possible to apply the most conservative Bonferroni-Holm correction. The association between study characteristics and the presence of multiple testing was evaluated with the v 2 test and Spearman correlation using Stata 12 (StataCorp, College Station, TX). The Kruskal-Wallis test was used to evaluate whether any of the three calculated error rates varied significantly by journal or study design, and the Wilcoxon rank sum test was used to compare error rates by use of a database. We applied the Bonferroni-Holm correction to generate a corrected significance level (corrected a 5.05/rank). RESULTS Sample Description Of the 195 articles that conducted at least one statistical test, 140 (72%) met our criterion for multiple testing in that they tested five or more simultaneous hypotheses in at least one family of tests (Fig. 1). A detailed description of study characteristics is shown in Table I. The collective number of families of tests within the articles reviewed was 554; 367 (66%) of those were made up of five or more hypothesis tests. For six articles, the number of hypotheses tested was not explicitly stated, but each was estimated from the list of tested variables described in the methods. Within 21 of the 195 articles (11%), there was evidence that the authors may have tested more hypotheses than they reported in the results. Of the 140 articles testing five or more hypotheses, 14 (10%) in some way corrected or accounted for multiple testing, whereas 126 (90%) did not. Of the 14 articles that addressed the problem of multiple testing, eight applied a statistical correction, one conducted an independent validation of results, one specified primary versus secondary outcomes, one claimed to be exploratory only, and three simply mentioned multiple testing and/or the need for further validation. Of the eight using statistical correction, five used the Bonferroni method, one 601
TABLE II. Error Rates for Articles With Unaccounted Multiple Testing (N 5 126). Characteristic Mean 6 SD Median Range Family-wise error rate 0.41 6 0.17 0.37 0 1 Error rate per experiment 0.61 6 0.78 0.45 0 11.3 Error rate, % 18 6 29 12 1 141 SD 5 standard deviation. the Tukey-Kramer method, and one the False Discovery Rate method, whereas one chose a decreased significance level of.005 without citing a method of correction. was not statistically significant when correcting for multiple testing (Bonferroni-Holm corrected significance level a 5.05/4 5.0125). For the articles that employed multiple testing without correction, neither the family-wise error rate nor the error rate per experiment differed significantly by journal, use of a large database, or study design (all P >.30). Furthermore, the percentage error rate did not differ significantly by journal (P 5.63) or study design (P 5.26). However, the percentage error rate was significantly lower in studies using multi-institutional or national databases compared with those that did not (P <.0001), and it remained significant after applying the Bonferroni-Holm correction (a 5.016). Error Rate Calculations The mean, median, and range for each error rate calculation are shown in Table II. The mean probability of obtaining at least one false positive in a family of inferences (family-wise error rate) was 0.41 6 0.17. The mean expected number of errors in a family of inferences (error rate per experiment) was 0.61 6 0.78. The mean percentage of significant tests likely to be false positives (percentage error rate) was 18% 6 29%. The Correction Experiment Of the 554 total analyses recorded, 90% were considered primary, in that they addressed in whole or in part the stated aim of the article. One hundred fourteen articles had unaccounted multiple testing among their primary analyses. Our application of statistical correction affected the results of 100 of the 114 articles (88%) that tested five or more hypotheses, meaning it changed whether or not at least one P value met statistical significance. Among all families of tests with multiple testing, there were 1,509 total significant P values. After applying the Bonferroni correction, 860 (57%) of these P values remained significant. After applying the Bonferroni- Holm correction, 948 (63%) of these P values remained significant. Associations With Multiple Testing and Error Rates The proportion of articles with multiple testing did not vary significantly by journal (P 5.95). Forty-six (24%) articles utilized a national database, four (2%) articles used a statewide or multi-institutional database, and 145 (74%) articled used data from a single institution. There was no association between the presence of multiple testing and whether or not data were drawn from a database (P 5.36). Over half of the studies reviewed were retrospective (n 5 110, 56%), 38 (20%) were cross-sectional, 21 (11%) were prospective, and the remainder classified as other (n 5 26, 13%). There was no association between multiple testing and study design (P 5.30). There was a small positive correlation between number of subjects and number of hypotheses tested (Spearman q 5 0.08, P 5.048); however, this correlation DISCUSSION Seventy-two percent of the articles reviewed met our criterion for multiple testing, indicating that multiple testing is common among recent large clinical studies published in high-impact otolaryngology journals. This finding is also reflected in the error rates calculated from the studies reviewed. The overall mean probability of obtaining at least one false positive was 41%, and an average of 18% of P values labeled as significant may be chance results. Our results did not reveal a significant predictor of multiple testing, such as retrospective versus prospective or large-database versus singleinstitution studies. The lower percentage error rate in large-database studies may reflect the greater statistical power in these studies (i.e., greater power to detect a larger number of true positives leaving a smaller proportion of false to true positives). Multiple testing is not a new challenge in research, and is not unique to otolaryngology. A recent review of multiple testing in ophthalmology research found that 23% of abstracts presented at a large meeting reported greater than five hypothesis tests, with a minority (14%) applying statistical correction. 9 Multiple testing was also examined in 173 articles identified from five randomlyselected issues of the American Journal of Public Health and the American Journal of Epidemiology published in 1996. Within these articles, the authors found an average false-positive rate substantially higher than the traditionally assumed level of 5% (P <.05). 12 Limiting our search to clinical articles with large subject numbers likely accounts for the higher percentage of multiple testing seen in this study versus that found in the ophthalmology literature, and does not suggest that multiple testing is more common in otolaryngology than in other fields. A minority (10%) of studies that employed multiple testing accounted for it, doing so by various methods. Some acknowledged exploratory findings, prespecified primary hypotheses, or conducted independent validations. Eight of the fourteen articles that accounted for multiple testing did so with statistical correction. There is a wide array of statistical approaches for controlling the family-wise error rate, but they can generally be broken up into two categories: single-step and stepwise. 13 Single-step procedures such as the Bonferroni and Sidak 602
corrections apply equal correction to all P values. 14 They provide strong control of the type I error rate, but reduce statistical power, increasing the type II error rate (false negatives). 5 They may not be optimal for all studies, especially those with smaller sample sizes. Stepwise procedures were subsequently developed to maintain greater statistical power while controlling the type I error rate. In these procedures, hypotheses are evaluated successively, and rejection of one hypothesis depends on the outcome of other hypothesis tests. The Bonferroni-Holm method is a recommended method due to its ease of calculation. Others have been described by Westfall and Young, 15 Shaffer, 16 and Simes, 17 to cite just a few. Although these methods assume the independence of the hypotheses tested, Dudoit et al. 13 have provided a well-written review of methods that can be considered for large-scale testing with correlated hypotheses, such as those in genomic studies. Ultimately, there is inherent subjectivity in determining what constitutes a family of inferences and in choosing the best adjustment. The key step is recognizing the challenges inherent in multiple testing and addressing them during the design and analytic phases of a study. This study has several limitations. A comprehensive search for every article with multiple testing within even a single journal was beyond the resources available for this study. Therefore, our findings do not represent the overall prevalence of multiple testing within the otolaryngology literature. The choice to review clinical articles with >100 subjects likely selected those in which multiple testing was common. In addition, there was inherent subjectivity in selecting the criterion for multiple testing and in delineating distinct families of tests. Our choice of a cutoff of five or more hypotheses to define multiple testing selected articles with a substantial risk of false-positive results (23% or greater). A choice of a lower cutoff may have resulted in a greater percentage of multiple testing articles within those reviewed. In addition, the low threshold to designate distinct families of tests minimized both number of tests and the likelihood of multiple testing within each family. Therefore, though we selected articles prone to multiple testing, our methods resulted in a conservative estimate of multiple testing within those articles. It is important to note that the assumption of independence of tests may not hold true for all articles included in our error-rate calculations. For example, when testing for risk factors for a disease, some (i.e., tobacco and alcohol use) may be correlated and thus not independent. The error rates are lower for analyses in which tests are correlated; therefore, the calculations obtained in our analysis should be considered an upper bound of the actual error rates for the studies reviewed. CONCLUSION Multiple testing is common in recent large clinical studies in otolaryngology, suggesting that this issue deserves closer attention from researchers, reviewers, and editors. In addition, given the high family-wise error rate found among the articles analyzed, the reader should keep in mind the possibility that some findings may be due to statistical error and may benefit from independent validation. There are several methodological and statistical approaches to address multiple testing to help ensure that one s clinical decisions are based on reliable evidence. BIBLIOGRAPHY 1. Popper Schaffer J. Multiple hypothesis testing. Ann Rev Psychol 1995;46: 561 584. 2. Bender R, Lange S. Adjusting for multiple testing. J Clin Epidemiol 2001; 2001:343 349. 3. Imberger G, Vejlby AD, Hansen SB, Moller AM, Wetterslev J. Statistical multiplicity in systematic reviews of anaesthesia interventions: a quantification and comparison between Cochrane and non-cochrane reviews. PloS One 2011;6:e28422. 4. O Brien PC, Fleming TR. A multiple testing procedure for clinical trials. Biometrics 1979;35:549 556. 5. Aickin M, Gensler H. Adjusting for multiple testing when reporting research results the Bonferroni vs. Holm methods. Am J Public Health 1996;86:726 728. 6. Gordi T, Khamis H. Simple solution to a common statistical problem: interpreting multiple tests. Clin Ther 2004;26:780 786. 7. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc 1995;57:289 300. 8. Holm S. A simple sequentially rejective multiple test procedure. Scand J Stat 1979;6:65 70. 9. Stacey AW, Pouly S, Czyz CN. An analysis of the use of multiple comparison corrections in ophthalmology research. Invest Ophthalmol Vis Sci 2012;53:1830 1834. 10. Sainani KL. The problem of multiple testing. PM R 2009;1:1098 1103. 11. Journal Citation Reports. May 22, 2012. Available at: http://admin-apps. webofknowledge.com/jcr/help/h_info.htm. Accessed June 6, 2014. 12. Ottenbacher KJ. Quantitative evaluation of multiplicity in epidemiology and public health research. Am J Epidemiol 1998;147:615 619. 13. Dudoit S, Popper Schaffer J, Boldrick J. Multiple hypothesis testing in microarray experiments. Statist Sci 2003;18,71 103. 14. Abdi H. Bonferroni and Sidak corrections for multiple comparisons. In: Salkind NJ, ed. Encyclopedia of Measurement and Statistics. Thousand Oaks, CA: Sage; 2007 15. Westfall PH, Young SS. Resampling-Based Multiple Testing: Examples and Methods for p-value Adjustment. New York, NY: John Wiley & Sons; 1993. 16. Shaffer JP. Modified sequentially rejective multiple test procedures. JAm Stat Assoc 1986;81:826 831. 17. Simes RJ. An improved Bonferroni procedure for multiple tests of significance. Biometrika 1996;73:751 754. 603