Multiple Comparisons and the Known or Potential Error Rate

Similar documents
Biol/Chem 4900/4912. Forensic Internship Lecture 2

CASE STUDY 2: VOCATIONAL TRAINING FOR DISADVANTAGED YOUTH

An Analysis of the Frye Standard To Determine the Admissibility of Expert Trial Testimony in New York State Courts. Lauren Aguiar Sara DiLeo

CHAPTER THIRTEEN. Data Analysis and Interpretation: Part II.Tests of Statistical Significance and the Analysis Story CHAPTER OUTLINE

Running Head: ADVERSE IMPACT. Significance Tests and Confidence Intervals for the Adverse Impact Ratio. Scott B. Morris

New York Law Journal. Friday, May 9, Trial Advocacy, Cross-Examination of Medical Doctors: Recurrent Themes

Identifying a Computer Forensics Expert: A Study to Measure the Characteristics of Forensic Computer Examiners

Handout 14: Understanding Randomness Investigating Claims of Discrimination

How to Conduct an Unemployment Benefits Hearing

How Bad Data Analyses Can Sabotage Discrimination Cases

Litigating DAUBERT. Anthony Rios Assistant State Public Defender Madison Trial Office

THIS PROBLEM HAS BEEN SOLVED BY USING THE CALCULATOR. A 90% CONFIDENCE INTERVAL IS ALSO SHOWN. ALL QUESTIONS ARE LISTED BELOW THE RESULTS.

Chapter 11. Experimental Design: One-Way Independent Samples Design

An Experimental Investigation of Self-Serving Biases in an Auditing Trust Game: The Effect of Group Affiliation: Discussion

Psychiatric Criminals

CHAPTER NINE DATA ANALYSIS / EVALUATING QUALITY (VALIDITY) OF BETWEEN GROUP EXPERIMENTS

December Review of the North Carolina Law of Expert Evidence. NC Is A Daubert State, Finally: State v. McGrady (N.C. S. Ct.

Book Review of Witness Testimony in Sexual Cases by Radcliffe et al by Catarina Sjölin

Variables Research involves trying to determine the relationship between two or more variables.

Statistical Lessons of Ricci v. De Stefano

State of Connecticut Department of Education Division of Teaching and Learning Programs and Services Bureau of Special Education

plural noun 1. a system of moral principles: the ethics of a culture. 2. the rules of conduct recognized in respect to a particular group, culture,

Research Methodology. Characteristics of Observations. Variables 10/18/2016. Week Most important know what is being observed.

Response to the ASA s statement on p-values: context, process, and purpose

TEACHING HYPOTHESIS TESTING: WHAT IS DOUBTED, WHAT IS TESTED?

Medical marijuana vs. workplace policy

Lesson 11.1: The Alpha Value

The Steps for Research process THE RESEARCH PROCESS: THEORETICAL FRAMEWORK AND HYPOTHESIS DEVELOPMENT. Theoretical Framework

CAPTURE THE IMAGINATION WITH VISUALIZATION:

Lieutenant Jonathyn W Priest

Introduction to Forensic Science and the Law. Washington, DC

Cancer case hinges on Hardell testimony Jeffrey Silva

How to Testify Matthew L. Ferrara, Ph.D.

FAQ: Heuristics, Biases, and Alternatives

Framework for Causation Analysis of Toxic Exposure Claims. CLCW SME Training August, 2017

Chapter 23. Inference About Means. Copyright 2010 Pearson Education, Inc.

of Cross Examination Expert Witnesses Irving Younger s Ten Commandments 6/9/2017

EXPERIMENTAL RESEARCH DESIGNS

Validity. Ch. 5: Validity. Griggs v. Duke Power - 2. Griggs v. Duke Power (1971)

Applications of Genomics to Toxic Torts

Participant Manual DRE 7-Day Session 28 Case Preparation and Testimony

The Validity of Repressed Memories as Evidence. shuts them out, but then is able to recollect memories of them years later? Repressed memory, or

Bias Elimination in Forensic Science to Reduce Errors Act of 2016

SEED HAEMATOLOGY. Medical statistics your support when interpreting results SYSMEX EDUCATIONAL ENHANCEMENT AND DEVELOPMENT APRIL 2015

Lecture 9 Internal Validity

HOW TO APPLY FOR SOCIAL SECURITY DISABILITY BENEFITS IF YOU HAVE CHRONIC FATIGUE SYNDROME (CFS/CFIDS) MYALGIC ENCEPHALOPATHY (ME) and FIBROMYALGIA

BIOMETRICS PUBLICATIONS

CROSS EXAMINATION TECHNIQUES

CHAPTER - 6 STATISTICAL ANALYSIS. This chapter discusses inferential statistics, which use sample data to


Special Education Fact Sheet. Special Education Impartial Hearings in New York City

Survey results - Analysis of higher tier studies submitted without testing proposals

Some Possible Legal and Social Implications of Advances in Neuroscience. Henry T. Greely

Determining the size of a vaccine trial

26:010:557 / 26:620:557 Social Science Research Methods

One-Way ANOVAs t-test two statistically significant Type I error alpha null hypothesis dependant variable Independent variable three levels;

Evaluation of PTSD with Dissociative or Recovered Memory Clients and Plaintiffs

Statistical Significance, Effect Size, and Practical Significance Eva Lawrence Guilford College October, 2017

EFFECTIVE MEDICAL WRITING Michelle Biros, MS, MD Editor-in -Chief Academic Emergency Medicine

Drug Testing Technologies: Sweat Patch

PROBABILITY Page 1 of So far we have been concerned about describing characteristics of a distribution.

Georgina Salas. Topics EDCI Intro to Research Dr. A.J. Herrera

FAQ FOR THE NATA AND APTA JOINT STATEMENT ON COOPERATION: AN INTERPRETATION OF THE STATEMENT September 24, 2009

Citation for published version (APA): Ebbes, P. (2004). Latent instrumental variables: a new approach to solve for endogeneity s.n.

Evaluation Models STUDIES OF DIAGNOSTIC EFFICIENCY

Item Analysis Explanation

What Colorado Employers Need To Know About Marijuana and Workers Compensation

Inferential Statistics

Statistical Techniques. Masoud Mansoury and Anas Abulfaraj

UNITED STATES DISTRICT COURT NORTHERN DISTRICT OF CALIFORNIA SAN JOSE DIVISION

Exhibit 2 RFQ Engagement Letter

Preparing Witnesses for Direct Examination Master Class: Working with Witnesses ABA 2018 Professional Success Summit By Kalpana Srinivasan

A. Indicate the best answer to each the following multiple-choice questions (20 points)

THE USE OF MULTIVARIATE ANALYSIS IN DEVELOPMENT THEORY: A CRITIQUE OF THE APPROACH ADOPTED BY ADELMAN AND MORRIS A. C. RAYNER

Hypothesis Testing. Richard S. Balkin, Ph.D., LPC-S, NCC

FRYE The short opinion

Establishing Causality Convincingly: Some Neat Tricks

Research Questions, Variables, and Hypotheses: Part 2. Review. Hypotheses RCS /7/04. What are research questions? What are variables?

Sheila Barron Statistics Outreach Center 2/8/2011

In this chapter we discuss validity issues for quantitative research and for qualitative research.

Impression and Pattern Evidence Seminar: The Black Swan Club Clearwater Beach Aug. 3, 2010 David H. Kaye School of Law Forensic Science Program Penn S

Supporting Information for How Public Opinion Constrains the U.S. Supreme Court

WISCONSIN ASSOCIATION FOR IDENTIFICATION NEWSLETTER

1 The conceptual underpinnings of statistical power

Basis for Conclusions: ISA 230 (Redrafted), Audit Documentation

FORENSIC PSYCHOLOGY E.G., COMPETENCE TO STAND TRIAL CHILD CUSTODY AND VISITATION WORKPLACE DISCRIMINATION INSANITY IN CRIMINAL TRIALS

Chapter 1 Review Questions

IN THE UNITED STATES DISTRICT COURT FOR THE EASTERN DISTRICT OF TEXAS MARSHALL DIVISION MEMORANDUM OPINION AND ORDER

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Social Studies Skills and Methods Analyzing the Credibility of Sources

Purpose: Policy: The Fair Hearing Plan is not applicable to mid-level providers. Grounds for a Hearing

Ethics for the Expert Witness

Establishing a Gender Bias Task Force

The Lie Detector Test for Soft Tissue Injury

The Journal of Physiology

POST TEST Alcohol Training Awareness Program

Results of the 2016 Gender Equality in the Legal Profession Survey

Validity. Ch. 5: Validity. Griggs v. Duke Power - 2. Griggs v. Duke Power (1971)

SAMPLING AND SAMPLE SIZE

THE RELIABILITY OF EYEWITNESS CONFIDENCE 1. Time to Exonerate Eyewitness Memory. John T. Wixted 1. Author Note

Transcription:

Journal of Forensic Economics 19(2), 2006, pp. 231-236 2007 by the National Association of Forensic Economics Multiple Comparisons and the Known or Potential Error Rate David Tabak * I. Introduction Many analyses in forensic economics are based on statistical methods that have a known Type I error rate. What is often overlooked is that if passing one of many tests will suffice, then the likelihood of supporting a claim or an affirmative defense via statistical analysis is larger than the error rate of a single test. This article examines this issue, referred to generally in the Federal Judicial Center s Reference Manual on Scientific Evidence, and takes the further step of tying it to the Daubert criterion of examining the known or potential error rate of an analysis. We show how multiple comparisons affect the known or potential error rate for a statistical analysis used in forensic economics under the assumption that the tests are independent, and examine when one can use statistical or theoretical evidence to test for independence. II. Error Rates When Performing Multiple Tests One of the key concerns of an expert witness is being sure that his or her testimony is admissible. The U.S. Supreme Court s decision in Daubert v. Merrell Dow Pharmaceuticals, Inc. (1993) listed a series of factors that should be considered, if appropriate, in examining proposed expert testimony. One of the factors with which experts employing statistical analysis should be most comfortable is the "known or potential error rate" of an analysis. Yet, often little thought is given to how this error rate applies not just to a particular statistical analysis but to a broader set of statistical analyses used to provide evidence for or against a proposition. Consider an expert examining the share of women being laid off at a company undergoing a reduction in force. If the expert finds that the difference between the expected number of women laid off and the actual number is statistically significant at the 5% level, she should be allowed to testify that the result is statistically significant at a commonly accepted level used in economic analyses. Suppose, on the other hand, that the company had 20 divisions and the expert decided to test for discrimination in each division separately and found a result that was statistically significant at the 5% level in one division. This information, without further analysis of the types discussed below, should not be considered evidence of discrimination, as this is the result we should expect if there was no discrimination: when examining 20 observations, one *NERA Economic Consulting, New York, NY 231

232 JOURNAL OF FORENSIC ECONOMICS should be statistically significant at the 5% level purely by chance even if the null hypothesis of no discrimination is true in all 20. If we define the Supreme Court s phrase "known or potential error rate" to mean the probability of achieving a result that falsely supports a finding of liability if performed by an expert for plaintiffs, or of achieving a result that falsely supports a finding of no liability if performed by an expert for defendants, the expert in the above example cannot say that her overall result has a known or potential error rate of only 5%. In fact, if the divisions are making independent firing decisions, the potential error rate for this analysis is 64.2%, or 1-0.95 20, meaning that there was a nearly two-thirds chance that at least one division would have had a result that is statistically significant at the 5% level. While not often brought up in the field of forensic economics, this reasoning is not esoteric. In fact, the "Reference Guide on Statistics," a part of the Reference Manual on Scientific Evidence published by the Federal Judicial Center and sent to every federal judge in the United States, provides the following definition in its glossary: multiple comparison. Making several statistical tests on the same data set. Multiple comparisons complicate the interpretation of a p- value. For example, if 20 divisions of a company are examined, and one division is found to have a disparity "significant" at the 0.05 level, the result is not surprising; indeed, it should be expected under the null hypothesis. Compare p-value; significance test; statistical hypothesis. (p. 166) This issue can arise in a number of different settings. For example, an expert for plaintiffs could be examining discrimination against numerous protected classes in a labor case or could be looking at the effects of numerous news releases on a stock price in a securities fraud case. Similarly, an expert for defendants could be pursuing an affirmative defense that a qualifications test for numerous job positions was a business necessity because employees with that qualification were better according to some measure of output or performance evaluation. Note that this should not be taken to mean that there is a bias whenever many possible tests could have been performed. Instead the bias occurs when multiple tests are performed and the expert would claim that the result of any one test would support a certain finding. 1 For example, if there was other evidence, such as letters by the person in charge of deciding who was laid off that disparaged women, it would be proper to report the standard level of statistical significance from a single test of whether more women were fired than would be expected if the company did not act in a biased manner. On the other hand, if there is simply a general suspicion that those in charge of deciding whom to fire have some unspecified bias, it would not be proper to examine numerous protected classes and then design statistical testimony around the one(s) where a disparity was present as if the others had not been examined. 1On the other hand, tests whose results are not reported or used in the ultimate calculation of damages because the results are not statistically significant should still count as part of the total number of tests performed.

Tabak 233 Again, in the words of the "Reference Guide on Statistics" in the Reference Manual on Scientific Evidence: 2 Repeated testing complicates the interpretation of significance levels. If enough comparisons are made, random error almost guarantees that some will yield "significant" findings, even when there is no real effect. (p. 127) III. Responses To Concerns About Multiple Comparisons There are various ways that multiple comparisons can affect the analyses performed by a forensic economist. First, if an opposing expert has performed multiple tests, it makes sense to calculate the known or potential error rate, where that error rate is defined as the probability that they would find a result in their client s favor even if there was no true effect. For example, if the expert is accepting results at the 5% level, then the chance that at least one of N independent tests would result in a statistically significant result (i.e., the potential error rate) is 1-0.95 N. Note that this formula is only strictly correct if the tests are independent, a question that will depend on the context of the analysis as discussed below. 3 The final two columns of Table 1 show the known or potential error rate of falsely rejecting the null hypothesis at least once when performing N independent tests (i.e., the probability of finding statistical evidence in favor of some effect in at least one test even if there is no true underlying effect). Second, consider what tests should be made before going on a fishing expedition. If the evidence supports looking at one particular area over others, it may make sense to consider limiting the analysis to that area or designating that in advance as what is called the "primary endpoint" in the biostatistics literature and then treating the other tests as secondary ones for which a multiple comparison adjustment is necessary. 4 Third, if multiple tests are necessary, report the adjusted levels of statistical significance for having made the numerous tests. A common adjustment for independent tests is a Sidak adjustment, which is simply the inverse of the 1-0.95 N formula given previously and involves setting each individual test s cutoff level of statistical significance to 1-0.95 1/N. 5 This new cutoff level answers 2Note that the "repeated testing" need not involve sequential testing, as the same issues arise with a set of simultaneous tests on the data. 3As a simple example of why this formula should not be applied to tests that are not independent, consider examining a person s pulse rate on their right and left wrists as part of a lie detector test to see whether there is a statistically significant jump after answering a certain question. The pulse rates should be identical and should really be thought of as a single test (at least as far as considering random fluctuations in pulse rates, though perhaps not in terms of the errors in the two instruments used to measure those rates). 4Essentially this means treating one test as the sole test that was designed to be performed and thus obviating the need for an adjustment for this test. Note that there should be a good reason for the designation that does not come from the data. (e.g., looking at the data to see where the biggest effect is and then designating that as the primary endpoint would be improper.) Also note that one can designate more than one test as the primary endpoint, but then a multiple comparisons adjustment should be made based on the number of tests so designated. 5The Sidak adjustment is an exact adjustment for independent tests. Details on the Sidak and other adjustments may be found in various publications covering multiple comparisons (e.g., Westfall, et al.)

234 JOURNAL OF FORENSIC ECONOMICS the question of what common level of significance for each individual test yields an overall error rate for finding one or more statistically significant results of 5%. 6 The second column of Table 1 shows the critical values for the z- statistic for an overall test at the 5% level. While the calculation of these critical values may not be easy to describe, the explanation for their growth as N increases follows from, and can help reinforce, the explanation of why some courts use two standard deviations as the cutoff for statistical significance with a single test. Because there will always be some variation in the observed data, any test will find some effect different from zero, though of either sign. The two standard deviation rule is a means of limiting the false positives so that there is only a 5% chance that we falsely state that the statistical evidence supports a finding in those cases where there truly was no effect. If we perform multiple tests, we have more opportunities to "find" evidence in favor of a proposition than if we just perform one test. Adjusting the two standard deviation rule as shown in Table 1 reduces the probability of finding evidence in favor of a proposition in each individual test in a way that we once again have only a 5% chance of declaring that the statistical evidence supports a finding in favor of the proposition in cases where there actually is no underlying effect. Table 1 Known or Potential Error Rate 3 N 1 5% Critical z-value 2 5% 2.5% 1-(1-0.05) N 1-(1-0.025) N 1 1.96 5.0% 2.5% 2 2.24 9.8% 4.9% 3 2.39 14.3% 7.3% 4 2.49 18.5% 9.6% 5 2.57 22.6% 11.9% 6 2.63 26.5% 14.1% 7 2.69 30.2% 16.2% 8 2.73 33.7% 18.3% 9 2.77 37.0% 20.4% 10 2.80 40.1% 22.4% 15 2.93 53.7% 31.6% 20 3.02 64.2% 39.7% 25 3.09 72.3% 46.9% 30 3.14 78.5% 53.2% 50 3.29 92.3% 71.8% 100 3.48 99.4% 92.0% Notes: 1N represents the number of independent tests or comparisons being made. 2Critical z-value calculated using significance level of 1-0.975 1/N, which assumes a two-sided test. 3The "known or potential error rate" is defined as the probability that at least one false positive would be found if the underlying data were truly random and N tests were performed with a given (5% or 2.5%) chance of finding a false positive (Type I error, or alpha) for each test. 5% is the expected percent of significant results at the 5% significance level. 2.5% is the expected percent of negative (or positive) significant results at the 5% significance level. 6Of course, the formula can be implemented for any desired cutoff level of statistical significance.

Tabak 235 IV. Independence and Multiple Tests As noted above, the standard adjustments for multiple comparisons are most useful when the tests are independent. In considering the issue of independence, one important concern for the forensic economist is that the statistical tests comport with the theory of the case. Thus, a claim that a companywide culture or incentives were in place that disfavored one group of employees may be a reason to start by considering the divisions as a combined entity, while claims that a company allowed division heads to act independently and potentially act on their individual biases may be a reason to treat the divisions as independent. Non-statistical evidence in the case may also lead to a decision to examine only certain divisions, thus affecting the number of comparisons made and the required statistical adjustment. In other types of cases, economic theory may provide a basis for assuming independence or a lack of independence under the null hypothesis (e.g., the efficient markets hypothesis claim that stock price movements are independent over time as opposed to Bertrand competition suggesting that competitors prices may move together even in the absence of collusion or price-fixing). When neither theory nor the particular allegations suggest a clear answer, one may again fall back on our definition of the error rate as the chance of falsely rejecting the null. This generally means that the data are being examined to see if one can reject a null of no discrimination or other illegal activity. 7 If, under such a null, one would generally expect the different divisions to behave independently with respect to the issue under consideration, then independence can be assumed for the purposes of the multiple comparisons test. If, instead, one has a basis for assuming that the divisions are not independent, one can begin with a joint test to see if there is liability in any division. However, unless one assumes a perfect correlation of the effect across the different divisions, the joint test cannot prove liability for each. 8 Further analysis, including the use of multiple comparison procedures would then generally still be necessary. V. Conclusion Often forensic economists are tempted to keep testing data until they find a favorable result. As Ronald Coase has been quoted as saying, "If you torture the data enough, it will confess." What is less recognized, and less quotable, is 7Alternatively, one might be examining an affirmative defense that is supported by rejecting the null in some test, such as an employer attempting to show that employees with a certain credential perform better, thereby supporting a claim that the employer should be allowed to discriminate on the basis of that credential. 8This is not to say that a joint test such as an F-test might not be useful as a preliminary step in investigating whether any effect exists. See, e.g., the Chapter 7.4.7 of the Engineering Statistics Handbook published by the National Institute of Standards and Technology at http://www.itl.nist.gov/div898/handbook/prc/section4/prc47.htm: "The ANOVA uses the F test to determine whether there exists a significant difference among treatment means or interactions. In this sense it is a preliminary test that informs us if we should continue the investigation of the data at hand." A problem arises only when one argues that an effect supported solely by a joint test should support claims or damages calculations for each division.

236 JOURNAL OF FORENSIC ECONOMICS that if you simply ask different parts of the data the same question, some parts will confess due to random variation unless you adjust your critical values for determining statistical significance. Economic theory and the theory of the case may help determine whether the data should be tested by considering different subsets to be independent or not. If the subsets are assumed to be independent, then testing numerous subsets requires a standard adjustment for multiple comparisons. Moreover, unless the subsets are assumed to be perfectly correlated, some adjustment for multiple comparisons will still be warranted. Finally, when multiple tests are made, researchers should be aware of the effects that these multiple tests have on the known and potential error rates of their own and opposing experts analyses. References Kaye, David H. and David A. Freedman, Reference Guide on Statistics in Reference Manual on Scientific Evidence, Second Edition, Federal Judicial Center, 2000. National Institute of Standards and Technology, NIST/SEMATECH e-handbook of Statistical Methods, U.S. Commerce Department s Technology Administration, reviewed on November 10, 2006. http://www.itl.nist.gov/div898/handbook/ Westfall, Peter H., Randall D. Tobias, Dror Rom, Russell D. Wolfinger, and Yosef Hochberg, Multiple Comparisons and Multiple Tests Using SAS, SAS Institute Inc., 1999. Cases Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579 (1993).